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SBNR  DIGITAL  SIGNAL  PROCESSOR  ARCHITECTURE 

1 .0  Scope  of  Work 

"There  is  a  great  deal  of  innovation  for  new  complex  special-purpose 
signal  processing  integrated  circuits. . .often  yielding  well  over  a  factor  of  a 
thousand  improvement  over  even  the  fastest  general-purpose  machines," 
Jonathan  Allen,  Fellow  IEEE,  in  "Computer  Architecture  for  Digital  Signal 
Processing,"  Proc.  IEEE,  Volume  73,  Number  5.  PP«  852-973.  Kay  1985. 

The  following  research  was  proposed  in  a  three-year  period.  This 
constitutes  significantly  distinct  efforts  which  complement  the  existing 
efforts  in  current  adaptive  signal  processor  architecture  research.  Briefly, 
tnese  tasks  comprised  a  study  of  non-conventional  number  system 
implementations  focusing  on  VLSI  enhancements  attributable  to  redundant  number 
systems.  This  increased  practical  knowledge  should  add  impetus  to  many 
potential  signal  processing  tasks  (target  trackers,  beamformers, 
communication,  receivers,  spread  spectrum).  Three  diverse,  yet  tightly 
coupled,  research  topics  were  posed,  centering  on  the  usage  of  Signed-Digit 
(SD)  arithmetic  to  solve  mult/acc  intensive  signal  processing  tasKS  (streaming 
data).  Efficient  implementations  for  signed-digit  arithmetic  were  sought  for 
systolic  arrays.  Connectivity  and  control  were  investigated  for  inherent 
fault- tolerance.  Lastly,  multiple-valued  logic  for  the .Signed  Binary  Number 
Representations  (SBNR)  was  studied  for  both  fault- to leranceT  and — array 
''fegularity.  The  dominant  and  focused  application  of  this  research  was 
efficient  solutions  of  specific  signal  processing  algorithms. 

2.0  Conventional  Number  Systems  Drawbacks 

The  most  serious  objection  to  using  the  conventional  number 
representations  (the  sign  magnitude,  the  radix  complement  and  the  diminished 
radix  complement  representations)  for  a  signal  processor  cell  is  that  addition 
in  these  representations  cannot  be  truly  parallel.  A  computing  cell  designed 
for  such  representations  cannot  be  easily  connected  to  run  in  parallel  with 
identical  cells  in  such  a  way  that  the  microsteps  involving  additions  can  be 
carried  out  in  time  independent  of  tne  number  of  cells.  For  signed-digit 
representations,  the  number  of  cells  and  the  precision  of  the  operands  will 
not  affect  the  time  of  such  microsteps.  The  time  needed  will  only  depend  on 
the  structure  of  an  adder  position. 

Another  convenience  in  designing  arithmetic  modules  with  signed  digits 
is  that  no  special  treatment  is  required  for  the  most-significant  position. 
For  radix  complement  or  diminished  radix  complement  notation,  special 
attention  is  needed  to  handle  the  arithmetic  shifts,  the  sign  of  multipliers 
and/or  tne  end-around  carries.  For  the  sign  magnitude  notation,  tne  sign  of 
the  result  of  an  addition  or  subtraction  requires  dedicated  circuitry.  All  of 
these  little  problems  do  not  exist  for  tne  signed-digit  notation.  The  shift 
input  for  any  arithmetic  shifts  is  always  zero.  The  indicator  digit  (the  sign 
digit)  can  be  treated  just  lixe  all  other  digits. 

The  serial  mode  of  processing  has  to  proceed  from  the  least-significant 
end  to  the  most-significant  end  if  the  conventional  number  representations  are 
used.  The  overflow  condition  or  tne  leading  zeros  can  be  detected  only  after 


the  last  segment  of  the  result  has  been  generated.  For  the  signed-digit 
notations,  serial  operations  can  proceed  from  the  most-significant  end. 
Processing  may  be  stopped  by  an  end  symbol  in  the  operands  such  as  the  space 
zero  discussed  by  Avizienis  [l].  This  can  lead  to  a  more  efficient  serial 
processing  procedure  if  the  allowed  precision  is  an  excess  of  the  needed 
precision.  Since  the  most  significant  digits  of  the  result  can  be  generated 
first,  the  overflow  condition  or  the  leading  zeros  can  be  detected  at  the 
beginning  of  an  operation.  Result  digits  may  be  stored  away  in  their  final 
positions  without  subsequent  corrective  shifts  which  is  not  necessarily 
trivial  in  a  multi-precision  environment. 

For  the  signed-digit  notations,  the  basic  aritnmetic  algorithms  for  each 
digit  position  are  essentially  invariant  with  the  position  of  the  operand 
digits.  Each  result  digit  is  dependent  only  on  the  operand  digits  in  a  fixed 
number  of  digit  positions.  Because  of  this  the  detection  and  the  correction 
of  hardware  errors  can  be  independently  implemented  for  each  digit  position  as 
suggested  by  Avizienis  [l].  The  "round-off"  error  resulting  from  simple 
truncation  is  without  bias.  For  a  mantissa  of  m  fractional  digits,  the 
maximum  absolute  truncation  error  in  the  mantissa  is  less  than  one  mantissa 
bit ! 

Besides  the  conventional  number  representations,  there  are  a  few  other 
novel  number  representations  which  have  advantages  in  special  situations  but 
are  not  suitable  for  this  variable  precision  module.  Examples  are  the  residue 
number  representation  and  the  negative  base  representation.  The  residue 
number  system  developed  from  linear  congruences  does  not  require  carry 
propagation.  The  multiplication  of  two  numbers  needs  as  little  time  as  the 
addition.  The  main  difficulties  of  the  residue  number  representation  pertain 
to  the  determination  of  the  relative  magnitudes  of  the  two  numbers  and  to  the 
division  process.  The  negative  base  number  system,  on  the  other  hand,  is  not 
easily  implemented  in  negative  bases.  The  sign  of  a  number  in  a  negative  base 
depends  on  whether  the  most  significant  digit  is  an  even  or  odd  position. 
This  complicates  the  division  process  since  the  signs  of  the  operands  and  the 
signs  of  the  intermediate  results  are  essential  in  any  division  algorithm. 

In  short,  the  signed-digit  systems  provide  two  dimensions  of  freedom: 
the  number  of  processor  modules  and  the  precision  of  operands.  These  allow  a 
variable  length  operation  to  be  practicable  in  a  processor  with  a  variable 
number  of  digit  positions.  The  signed-digit  systems  are  natural  choices  for 
the  present  module  which  is  required  to  process  operands  with  a  variable 
precision  either  by  itself  or  in  parallel  with  a  variable  number  of  identical 
modules. 

2.1  Task  Summaries 

1 •  We  studied  the  impact  of  signed-digit  number  systems  for  signal 
processor  implementations.  Specifically,  we  proposed  to  implement  new  ALU 
structures  within  the  context  of  recursive  algorithms  (LMS,  LS ,  SVD,  Givens 
Rotations,  ...)  focusing  on  fault-tolerant  architectures. 

2.  We  analyzed  at  least  four  architectures:  fully- parallel  multiple 
adder/mult,  structures,  distributed  arithmetic  structures,  multiple  operand 
adder  structures,  and  ROM/adder  structures  making  maximal  use  of  pipelining 
and  parallel  mecnanisms. 


3»  We  studied  engineering  trade-offs  among  conventional,  2's  complement 
arithmetic  and  signed-digit  arithmetic  to  reduce  pin  count  and  develop  more 
functionally  robust  devices  for  signal  processors.  Multi-valued  logic 
circuits  were  considered. 

2.2  Synopsis  of  Proposed  Method 

Classical  techniques  were  applied  to  these  tasks.  Namely,  VLSI 
floorplans  were  produced  and  area/time  figure  of  merits  generated.  Analytical 
comparisons  were  then  established  using  popular  benchmarks  as  the  LMS 
algorithm  and  least-squares  algorithms  applied  to  signal  processing  of  radar, 
sonar,  and  communications  tasks. 

2.3  Objectives  Summary 

We  sought  to  demonstrate  the  effectiveness  of  each  implementation 
towards  design  goals  such  as  speed,  power,  weight,  and  size.  Additionally,  we 
intended  to  demonstrate  the  efficiencies  of  signed-digit  implementations  which 
supposedly  have  minimal  interconnects  between  adjacent  digit  positions.  We 
demonstrated  the  superior  modular  features  of  signed-digit  for  ALU's  in 
adaptive  signal  processors. 

3.0  Identification  and  Significance  of  Opportunity 

This  focused  architecture  study  exploited  promising  memory-oriented 
structures  common  to  distributed  arithmetic  organizations  because  the  costly 
multiply/accumulate  cycle  (typical  in  signal  processing)  reduces  to  fast 
shift/add  cycles.  Secondly,  signed  binary  number  representations  ( SBNR ) ,  a 
subset  of  redundant  number  codes,  were  realized  with  higher  information  per 
wire  ratios,  thus  reducing  intercell  connections  (a  relatively  high  VLSI  cost 
in  current  conventional  number  systems).  Thirdly,  multi- valued-logic 
(although  slow)  maps  SBNR  representation  one-to-one.  Hence,  its  effectiveness 
was  studied. 

As  a  result,  digital  signal  processing  applications  such  as  FFT's, 
convolutions.  Hartley  transforms,  beamforming,  coding,  communications 
receivers,  target  trackers,  and  antenna  arrays  stand  to  achieve  lower  power 
requirements  and  higher  microminiaturization  levels.  There  is  a  great  need 
for  ultra-fast  FFT’s  in  spread  spectrum.  Because  no  architecture  research 
operates  in  a  vacuum,  we  collaborated  with  NCR,  TRW,  and  RCA  foundries  to 
eventually  test/develop  actual  devices.  NCR  is  particularly  interested  in 
this  study  because  its  local  RAD  facility  designs  systolic  array  devices 
(notably  the  NCR  45CG72 ,  the  GAPP  6x12  PE  chip,  for  which  Space  Tech  has  been 
writing  signal  processing  algorithms).  This  is  the  one  of  the  few  available 
true  systolic  array  chips,  and  an  excellent  testbed  for  our  studies. 

[A  difficulty  in  terminology  now  arises.  In  this  research,  we  studied 
redundant  number  systems  (sometimes  called  redundant  coding,  SDNR ,  SBNR,  and 
mistakenly  called  negabinary  and/or  mirror  numbers).  We  also  investigated 
fault-tolerant  properties  of  this  number  system  partially  with  redundant 
circuits  (here,  "redundancy"  refers  to  more  than  one  circuit).  We  hope  the 
reader  can  determine  the  meaning  from  its  context. 


3.1  Multi-Valued-Logic  and  Systolic  Arrays 


In  recent  papers  by  Hurst  [2]  and  this  Principal  Investigator  (see 
Appendix),  it  is  noted  that  multi-valued  logic  (MVL)  may  show  great  promise  in 
the  future  for  VLSI.  At  present,  binary  systems  are  facing  interconnect 
problems  wnich  appear  to  be  insurmountable.  Silicon  areas  devoted  to 
intrachip  connections  now  consume  twice  the  area  of  active  logic  elements  on 
the  chip  [3].  Array  implementations  whether  data-flow,  systolic,  or  otherwise 
cause  a  severe  escalation  of  interconnect  area  thus  rendering  lower  silicon 
area  efficiencies.  LiKewise,  off  chip  connections  are  generating  new  and 
complex  problems  for  the  board  designer.  These  packaging  solutions  are  not 
without  concomitant  thermal  and  mechanical  constraints.  Such  factors  cause  us 
to  reflect  upon  denser  information  content  to  interconnection  ratios.  A 
solution  using  higher  radix  arithmetic  is  proposed  and  coupled  with  MVL 
promises  to  relieve  some  of  the  silicon  area  inefficiencies  when  conventional 
binary  arithmetic  is  used.  Even  for  tne  regular  architectures  of  systolic 
arrays,  Moraga  [4]  has  shown  the  effectiveness  of  such  MVL  implementations. 

3.2  Computational  Model 

Our  VLSI  model  of  computation  to  derive  complexity  measures  was  based  on 
the  following  generally  accepted  assumptions  [ 5— 7 J : 

a.  Wires  have  minima]  widtn  W*A( const);  hence  W  is  the  unit  of  measure  for 
the  area. 

p 

b.  The  area  required  to  store  one  bit  of  information  is  A(w)  );  tne  distance 
between  parallel  wires  is  A(W). 

c.  Double  layer  metalization  is  allowed. 

d.  Wires  run  only  horizontally  and  vertically. 

e.  Each  transistor  needs  a  minimal  transition  time,  Y=*A(k)  (k  is  a  constant), 
to  change  its  state.  Thus  Y  is  the  unit  execution  time. 

f.  A  binary  signal  propagates  along  a  wire  in  time  A(w).  Any  long  wires  of 
length,  L,  require  respective  buffer/drivers  with  area  A*A(W)  x  0(L). 

3.3  Signed-Digit  Number  Representation 

In  the  most  general  sense,  a  redundant  number  system  allows  both  an 
increase  in  the  number  of  positive  digits  and  negative  digits  as  follows. 
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The  representation  described  by  Eqs.  (1-3)  is  called  redundant  notation  with 
base  d.  The  above  mapping  of  the  number  representation  (or  notation),  wre(i  > 
assigns  to  a  sequence  a  ^ . . .  a  of  digits  a  value  from  a  range  Q  where  Q 
may  be  an  integer,  the  reals,  or  zero. 
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The  general  redundant  representation  does  not  lead  to  efficiencies  in 
algorithm  computations  or  implementations.  If  subsequent  restrictions  (Sec 
3«4)  are  placed  upon  the  general  redundant  notation,  very  attractive 
properties  support  efficient  implementations.  However,  the  basic  properties 
of  Signed-Digit  Number  Representations  (SDNR)  are: 
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a.  The  radix,  d,  is  a  positive  integer. 

b.  The  SDNR  of  the  algebraic  quantity,  zero,  is  unique  if 

m=n,  (d-l)  >_  m  and  m-1  >_  d-1  )/2  (4) 

c.  Transformations  between  conventional  representations  and  SDNR  exist. 

d.  Totally  parallel  addition/subtraction  are  possible. 

e.  Addition  and  subtraction  of  two  numbers  are  free  of  serial  propagations  of 
carry/ borrows. 

f.  SDNR  numbers  are  positionally  weighted. 

g.  The  polarity  of  an  SDNR  number  is  given  by  the  polarity  of  its  most 
significant  non-zero  digit. 

h.  No  special  treatment  is  needed  for  the  most-significant  position. 

i.  Addition/subtraction  time  is  independent  of  operand  length. 

Avizienis  [ 1 ] ,  Atkins  [8],  Tung  [9],  Ercegovac  [lO],  and  Robertson  [ 1 1 ] 
have  shown  that  SDNR  can  effectively  operate  in  a  general  purpose  digital 
computer  for  the  following  reasons. 

1  .  Redundancy  introduced  into  the  adder-subtracter  structure  reduces  (but 
does  not  entirely  eliminate)  carry-borrow  propagation  leading  to  rapid 
multiplication. 

2.  Full  precision  comparison  of  the  divisor  and  partial  remainder  in  division 
algorithms  is  not  required  because  quotient  digits  can  be  determined  from 
relatively  few  high  order  bits. 

3.  Negation  is  a  simple  logical  complementing  of  the  sign  bits  (e.g.,  unlike 
two's  complement  notation  which  requires  an  additional  step,  adding  an  LSB 
"one").  As  was  seen  in  the  ILLIAC  III  [8],  such  negation  expedites  execution 
of  floating-point  addition  and  subtraction. 

4.  Variable  lengtn  operand  formats  and  parallel  vector  arithmetic  are 
facilitated  by  basic  properties  of  SDNR's.  First  and  foremost,  operations  can 
proceed  from  lef t-to-right  (rather  than  right-to-left  as  required  in  I's,  2's 
complement  representations).  Secondly,  if  appropriately  implemented,  the 
position  of  the  least-significant  digit  need  not  be  known  for  adders  and 
subtracters. 

5-  Because  a  signed-digit  combination  adder/subtracter  needs  no  carry/borrow 
in  the  LSD,  the  ALU  can  be  partitioned  into  identical  and  cascadable  single 
digit  adder/subtracters.  VLSI  implementations  tend  to  become  highly  regular. 

6.  Multiplication  with  SDNR  tends  to  automatically  produce  rounded  results 
(of  great  importance  in  computationally  intensive  signal  processing 
applications).  In  fact,  Robertson,  based  on  woric  by  Rohatsch  [  1 2 ] ,  has  shown 
that  the  probability  of  obtaining  a  rounded  result  is  5/6. 

7.  SDNR  allows  unusual  algorithms  such  as  wired-in  significant-digit 
antnmetic  [  1 3 ]  and  dual  notation  algorithms  capable  of  accepting  both  SDNR 
and  conventional  operands  (i's,  2's  complement)  to  produce  SDNR  results  [ 1 4 ] . 
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These  observations  lead  to  tne  implementation  of  a  universal  Aritnmetic 
Building  Element  (ABE)  capturing  not  only  the  preceding  algorithms  but  also 
efficiently  separating  functions  of  logic  designs  and  arithmetic  design  [ 1 5 ] - 
8.  Overflow  detection  can  occur  immediately  following  the  production  of  most- 
significant  result  digits  (unlike  conventional  notation). 

3-4  Efficient  SDNR  Realization 

Several  implementations  based  on  the  SDNR  have  already  been  investigated 
[ll  ,16,17,18].  All  of  these,  however,  sought  to  satisfy  general  data 
processing  requirements  of  a  mainframe  computer.  In  contrast,  signal 
processing  applications  are  generally  multiplication/addition  intensive.  (Of 
late,  the  utility  of  distributed  arithmetic  [ 1 9  » 20 , 2 1 ]  has  shed  new  light  on 
bit-wise  algorithms,  also  essentially  partial  product  and  accumulation 
intensive. ) 

An  allowed  digit  set  (-1,0,1)  which  is  a  subset  of  tne  SDNR  is  assumed. 
A  redundant  Signed  Binary  Number  Representation  (SBNR): 

(5) 

represents  a  number  whose  value  is  expressed  as 
n 

sum  X  .  21'1  (6) 

i=1 

The  importance  of  SBNR  is  as  follows: 

a.  Conversion  of  unsigned  binary  numbers  to  SBNR  is  unnecessary  as  they  are 
identical . 

b.  Since  a  two's  complement  binary  representation  (X  X  ....X.),  expresses 

the  number  n  n  W 

n-1 

-Xn2n_1  +  sum  Xi.21"1  (7) 

i=1 

thi3  same  number  can  be  expressed  in  SBNR  by 

^XnXn-1  *"X1  ^SD  ^ 

because  the  sign  bit  X^  in  2's  complement  representation  is  considered  to  have 
weight  -2n_  .  Hence,  conversion  from  signed  binary  2's  complement 
representation  to  SBNR  is  simply  an  inversion  of  the  sign  bit  alone! 

Avizienis  [l]  further  demonstrated  tnat  the  SBNR  (radix  d=2  with  digit 
values  -1,0,1)  with  a  decreased  redundancy  requirement  (invoking  a  two-step 
addition  by  allowing  the  propagation  of  the  transfer  digit  over  two  digital 
positions  to  the  left)  requires  only  d+1  sum  digits.  In  general,  he  showed 
tnat  the  lower  limit  of  required  redundancy  of  one  digit  depends  on  the  number 
of  digital  positions  tne  transfer  digits  propagate  as  follows. 


9 


If  GIVEN: 

a.  no  redundancy  utilized 

b.  s.,  sum  digit  [d  values  only! 

THEN: 

si  ■  (9) 

=»  ith  adaena  digit 

yi  *  itn  augend  digit 

If,  however,  in  }d+1  values},  then  Eq.  (9)  becomes 

s.  *  f(vyi*zi+1’yi+1’zi+2'yi+2}  (10) 

and  if  si  in  jd+2  values  or  morel,  then  Eq.  (9)  becomes 

s.  *  f(z.,y.,z.  ,,y.  .)  ( 1 1  ) 

i  i,Ji*  i+1  ,Ji+1  '  ' 

Using  these  observations,  a  single  cell  can  implement  the  one  digit 
adder/subtracter  if  certain  choices  for  a  redundant  digit  are  always  made. 
Specifically,  let  any  redundant  binary  digit  be  represented  by  two  bits  Xg  and 
X^  as  follows  where  1  =  -1 . 

Table  1 .  Redundant  Digit  Selection  Rule 


Redundant 

Digit 

X 


Representation 
Sign  Digit 


Invoking  this  TRIT  realization  for  our  SBNR  further  simplifies  the  cell 
implementation  without  sacrificing  the  transfer  digit  propagation  advantage. 
Using  this  subset  allows  six  types  of  intermediate  results  in  the  first  of  two 
addition  steps  a3  defined  in  Table  2. 
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Table  2.  Intermediate  Addition  Step 

Classes 

Interned. 

Sum 

(•i) 

Type 

Augend 

Addend 

(y1 ) 

Next  Lower 
Position 

(xi-i*yi-i) 

Carry 

(Ci> 

1 

1 

1 

— 

1 

0 

Both  are 

1 

0 

positive 

1 

T 

2 

At  least  one 

0 

1 

is  negative 

0 

i 

3 

0 

0 

0 

0 

11  0  0 

4 

11  -  0  0 

__  Both  are  __ 

0  1  positive  0  1 

5  _  At  least  one  __ 

1  0  is  negative  1  1 

6  T  T  T  0 

The  second  step  in  an  addition  cycle  adds  s.  and  c.  1  from  the  next  lower 
position  to  obtain  a  sum  digit  with  no  carry/'borrow  generation  required. 

If  we  allow  any  redundant  binary  digit  to  be  represented  as  X  X.  witn 
the  redundant  digit  selection  rule  as  prescribed  in  Table  1,  the  SBoolean 
equations  which  govern  selection  and  addition  per  Table  2  produce  two  critical 
observations.  The  itn  SBNR  carries,  Cg  and  Cd,  depend  only  upon  the  ith,  i-1 
digits  and  i-1  carries.  Hence,  carry  propagation  extends  only  into  the  next 
adjacent  digit  column.  SBNR  addition  does  not  require  full-word  carry 
propagation  as  in  binary  addition.  SBNR  addition  makes  systolic  array 
implementations  straightforward.  Pre-scrambling  bits  or  words  is  not 

required. 

A  primitive  cell  suitable  for  large  VLSI  arrays  and  especially  for 
adaptive  signal  processors  must  have  few  interconnections  beyond  its  nearest 
neighbors  and  must  have  very  simple  controls.  VLSI  arrays  effectively 
function  in  a  data-flow  manner.  Fortunately,  many  signal  processing 
algorithms  can  be  implemented  with  distributed  or  bit-serial  arithmetic. 
Mactaggart  and  Jack  [22],  and  others  have  shown  that  bit-serial 
implementations  offer  a  highly  regular  design  and  lower  power  consumption  than 
conventional  arithmetic.  One  such  cell  [  1 6 ]  is  depicted  in  Figure  1.  This 
cell  implements  the  basic  addition/subtraction  steps  of  Table  2  using  the  SBNR 
of  (-1,0,1)  and  the  redundant  digit  selection  rule  of  Table  1. 


Figure  1.  Primitive  Bit-Serial  Cell  [16] 


3.5  Matrix  X  Matrix  Multiplication 

Moat  of  the  computational  effort  expended  by  a  digital  signal  processor 
is  devoted  to  matrix  x  matrix  multiplication.  Such  matrix  operations  may  be 
either  sums  of  word  level  products  or  sums  of  bit  level  products.  We  now  Know 
that  a  strong  relationship  exists  between  word  and  bit  level  systolic  arrays 
[23].  If  we  treat  such  computational  problems  from  the  outset  as  bit  level 
manipulations,  fast  area  efficient  VLSI  arrays  are  possible  [24,25].  In  our 
implementations,  a  systolic-like  bit  level  approach  is  assumed  where  eacn 
processing  cell  is  a  multiplier  and  gated  full  adder.  However,  the  multiplier 
and  adder  utilize  SBNR  rather  than  2's  complement  arithmetic  for  reasons 
discussed  earlier. 

Another  advantage  to  SBNR  is  the  absence  of  special  circuitry  and 
algorithms  to  handle  signal  operands.  In  2's  complement  arithmetic,  the  Baugh 
Wooley  algontnm  can  be  used  (with  an  attendant  high  latency  cost).  In  this 
procedure,  2's  complement  words  are  treated  as  positive  numbers  if: 

1 .  A  fixed  correction  term  is  added  to  the  result  for  each  word  level 
multiplication. 

2.  All  partial  products  normally  with  a  negative  weighting  are  complemented. 

Two's  complement  implementations  on  a  systolic  array  require  a  negative 
weighting  flag  or  a  tag  on  the  partial  products  which  must  propagate 
vertically  down  through  the  array.  Hence,  another  latch  and  control  line  is 
required  for  each  columnar  patn.  Furtnermore,  final  addition  of  correction 
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terms  requires  an  initialization  of  the  accumulators  in  tne  adder  trees  to  a 
value,  which  is  generally 

(2m  _  ^m-l)  x  n  (12) 

In  general,  the  number  of  systolic  array  cells  required  to  multiply  two 
nxm  matrices  witn  elements  m  bits  long  in  a  fully  parallel  fasnion  is  a  total 
of 


n  x  (2m  +  log2n)  ( 1 3) 

cells  where  word  growth  is  taken  into  account.  However,  McCanny  and  McWhirter 
[26]  have  identified  a  procedure  to  halve  tne  number  of  cells  by  removing 
intermediate  zero  bits.  The  procedure  is  to  permit  only  one  set  of  words 
within  a  given  row  to  move  at  any  time  slot,  keeping  the  other  set  of  words 
within  the  row  at  fixed  sites.  Then,  in  the  next  time  slot,  move  the  fixed 
set  of  words  and  keep  the  previously  moved  set  fixed.  This  alternation  of 
left/right  moves  can  be  maintained  by  latching  bits  using  half  the  system 
clock  speeds.  Successive  rows  in  tne  array  must  move  in  anti-phase  relative 
to  each  other. 

3.6  Device  Redundancy  and  Fault-Tolerance 

Any  practical  architectures  designed  today  should  be  highly  fault- 
tolerant.  Circuit  redundancy  and  built-in-self- test  are  theoretically 
achievable.  Redundancy  (of  elements,  not  arithmetic  codes)  does  offer  one 
additional  advantage  to  the  chip  builder.  The  system  designer  can  run  models 
long  before  production  of  the  new  system  starts.  But,  reliably  high-yield 
logic  chips  for  these  machines  are  often  difficult  to  achieve  because  the 
system  designer  always  wants  tne  very  latest  in  technology.  Redundancy  in  the 
basic  logic  design  can  enhance  the  yield  by  a  significant  amount  and  greatly 
reduce  the  wafer  start  requirements.  When  the  yield  increases  and  production 
starts,  this  same  redundancy  is  now  available  to  improve  reliability. 

The  model  in  Section  5.7  demonstrates  the  dependence  of  yield  on  the 
nature  of  the  defects  and,  together  with  gross  yield  estimates  and  the 
appropriate  nonredundant  yield  factor,  it  will  serve  as  a  reasonable  starting 
point  to  model  actual  yield  data.  The  existence  of  complex  local  correlations 
and  some  non- point-like  defects  will  clearly  complicate  matters,  although,  in 
many  cases,  a  perturbative  approach  will  be  adequate  to  model  the  situation. 
Understanding  yield  issues  is  important  to  architecture  design. 

3.7  Redundancy,  Fault-Tolerance  and  Testing 

Achieving  high  reliability  in  a  complex  device  or  system  is  a  difficult 
but  critical  task.  The  investigations  for  this  project  have  included  a 
careful  consideration  of  reliability  and  testability  considerations.  It  is 
now  challenging  for  manufacturers  to  maintain  a  compound  growth  rate  in  per- 
circuit  reliability  of  60$.  Past  methods  are  no  longer  valid.  Tne 
verification  of  machine  reliability  due  to  electronic  components  poses  a 
significant  challenge  to  tne  future.  For  example,  consider  two  realistic 
examples.  Assume  a  computer  with  1 ,C00  circuits/ chip.  Suppose  that  a 
manufacturer  builds  1  ,000  macnmes  to  acmeve  5CK  user  power-on  hours  per 
machine  at  tne  usual  i.COC  hour  MTT?  for  tne  electronics.  This  corresponds  to 
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a  50  PPM  cumulative  fraction  fail  for  tne  chips.  To  verify  this  failure  rate 
at  a  90  percent  confidence  level,  a  manufacturer  would  have  to  test  80,000 
chips  to  allow  for  one  failure  during  testing.  For  an  overall  production, 
this  represents  eight  percent  of  a  production  run.  Historical  trends  indicate 
tnat  tne  overall  reliability  rate  improvement  is  approximately  2X  each  year. 
By  1994,  if  we  consider  the  constraints  of  the  first  example,  the  reliability 
must  be  theoretically  improved  by  1 ,000  times.  Verifying  reliability  to  tne 
same  confidence  level  would  require  80M  chips,  or  80  times  the  number  of  the 
actual  production  run.  Clearly,  reliability,  testing,  and  redundancy  are 
intimately  coupled. 

The  following  aspects  of  reliability  were  considered: 

1.  A  major  problem  with  all  complex,  newly  designed  devices  which  use  a 
sufficiently  large  chip  area  is  poor  yield.  The  yield  can  be  significantly 
improved  by  using  redundancy.  During  the  final  manufacturing  stages,  devices 
with  only  a  few  "bad"  cells  can  be  reconfigured  to  leave  out  bad  cells. 

2.  A  similar  approach  for  fault- tolerance  can  be  used  for  hard  failures 
in  the  field.  In  this  case,  reconfiguration  has  to  be  dynamic.  This  means 
that  after  a  cell  had  been  detected  to  be  faulty,  the  array  configuration  has 
to  be  altered  under  program  control. 

3.  To  study  the  effectiveness  of  fault-tolerance  and  for  optimizing 
such  designs,  estimation  of  hard  and  soft  failure  rates  is  required.  Because 
the  handling  strategies  can  be  different,  hard  and  soft  failures  often  have  to 
be  considered  separately.  Preliminary  estimates  are  based  on  empirical 
techniques.  Such  estimates  are  not  very  accurate,  but  are  still  indispensable 
when  evaluating  different  design  options. 

Consideration  of  soft  failures  is  especially  important  for  Multi-Valued 
Logic  (MVL)  devices.  Because  the  voltage  range  is  divided  into  more  than  two 
regions,  it  will  take  much  less  energy  (from  electromagnetic  noise  or  alpha- 
particles,  etc.)  to  cause  an  extraneous  transition. 

4.  Testing,  both  by  the  manufacturer  and  in  the  field,  is  an  integral 
part  of  reliability  strategy.  It  is  now  recognized  that  testing  must  be 
considered  during  the  design  pnase  itself.  Two  aspects  of  testing  will  be 
considered.  Design-For-Testability  (DFT)  is  to  be  used  for  easier  and  faster 
test-pattern  generation  and  applications.  The  other  is  Built-In-Seif-Test 
(BIST),  which  allows  a  system  to  exercise  itself  and  verify  correctly 
operating  hardware. 

4.0  Technical  Objectives 

Succinctly,  the  technical  objectives  of  this  effort  were: 

a.  Determine  intrinsic  properties  of  SBNR  embedded  as  PE's  in  a  systolic 
array  via  distributed  antnmetic  cells.  Capitalize  on  the  inherent  modular 
properties  of  residue  numbers  to  be  implemented  in  S3NR  engines. 

b.  Establish  highly  modularized  architectures  using  SBNR  arithmetic  engines 
to  increase  information  per  wire  ratios. 


c.  Determine  engineering  trade-offs  of  power,  weight,  and  size  for  SBNR  array 
architectures  to  help  a  system  designer  and  silicon  floorplanner  lay  out 
competitive  VLSI  devices. 

4.1  Higher  Radix  Implementations 

We  considered  at  least  one  implementation  of  higher  radix  arithmetic, 
namely  ternary,  which  when  viewed  as  a  redundant  or  signed-digit  number  system 
held  promise  for  signal  processing  applications  in  which  division-sparse 
operations  occur.  We  studied  signed-digit  number  representations  and  basic 
properties  attractive  to  signal  processing  applications  which  manipulate 
sequential  data  streams. 

4.2  Systolic  Array  PE 

We  identified  a  realization  of  TRITS  (ternary  digits)  which  serves  as 
the  primitive  VLSI  cell.  The  regular  nature  of  this  cell  enhances  systolic 
array  architectures.  Multiple-valued  encoding  affords  us  the  opportunity  to 
reduce  ripple-through  carries.  Ternary  arithmetic  may  have  a  balanced  as  well 
as  an  unbalanced  coding.  Balanced  encoding  requires  less  gates  when  compared 
to  binary  and  unbalanced  encodings.  Unfortunately,  logic  delay  increases 
[27].  However,  in  the  TRIT  realizations  utilized  herein,  a  balanced  encoding 
coupled  with  redundancy  in  the  encoding  improves  both  logic  delay  and  gate 
count. 


This  Principal  Investigator  has  considerable  design  experience  with 
systolic  array  PE's.  He  has  designed  control  units  for  the  first  systolic 
array  (NCR  45CG72)  and  generated  several  signal  processing  algorithms  for  it 
in  conjunction  with  NCR  (including  LMS,  LS  and  SVD  for  adaptive  beamformer 
applications).  From  the  experiences,  a  basis  for  new  and  faster  circuits  can 
be  identified.  One  such  candidate,  SBNR  PE  suitable  for  a  systolic  array,  is 
shown  in  Figure  2.  This  is  a  derivative  of  tne  NCR  cell  with  several  critical 
differences.  First,  additional  latches  and  data  paths  exist.  Second,  RAM  is 
much  larger  at  each  cell.  Third,  internal  cell  pipelining  is  used  to  speed 
effective  instruction  execution  (not  easily  shown  in  a  block  diagram). 
Fourtn,  the  cell  implements  signed-digit  arithmetic  with  fewer  intracell 
connections.  Lastly,  this  single  cell  can  do  multiply,  add,  and  subtract  in 
fewer  steps.  A  systolic  array  module  (SAM)  of  this  PE  is  depicted  in  the 
floorplan  of  Figure  3. 
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Figure  2.  An  SBNR  Data  Flow  Cell 
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4.3  Signal  Processing  Algorithm 

There  is  an  intimate  coupling  between  word  and  bit  level  matrix  x  matrix 
multiplication.  A  systolic  implementation  of  comm_n  algorithms  invoicing  a 
digit  subset  of  redundant  number  representations,  Signed  Binary  Number 
Representations  (SBNR)  is  easily  realized  with  TRIT  MVL.  A  significant 
property  of  redundant  number  systems  supports  the  production  of  lef t-to-right 
(most-significant-digit  to  least-significant-digit)  algorithms.  Sips  [28]  has 
demonstrated  the  utility  of  left-to-nght  algorithms  for  a  general  purpose 
computer.  We  found  this  RTL  property  extremely  beneficial  for  the  ADC  and  DAC 
interface. 

We  analyzed  appropriate  ADC  and  DAC  SBNR  realizations.  It  is  important 
to  note  that  the  realizations  directly  carry  over  from  a  property  of  redundant 
numbers.  This  is  vital  for  real-time  signal  processing  (which  i3 
predominantly  analog  sourced). 

4.4  Fault-Tolerant  Properties 

It  is  important  to  identify  PE’s  that  are  highly  regular,  have  minimal 
I/O  pinout  requirements,  have  minimal  gate  count,  inter-  and  intra-circuit 
connectivities  and  low  power  requirements  that  support  a  high  degree  of  fault- 
tolerance.  VLSI  technologists  are  fast  developing  wafer  scale-integration.  A 
major  problem  with  such  assemblies  is  that  some  cells  are  liicely  to  be 
defective.  Hence,  a  major  objective  was  to  determine  optimal  reconf igurable 
networks  "around”  such  faults  for  our  SBNR  PE  systolic  arrays.  The  procedure 
was  to  minimize  the  length  of  the  longest  wire  in  the  system,  thus  minimizing 
tne  communication  time  between  cells.  Channel  width  was  also  a  major 
consideration.  The  procedure  assumed  a  probabilistic  model  of  cell  failure 
since  Leighton  and  Leiserson  [29]  have  demonstrated  many  positive  results.  In 
many  ways  this  problem  is  similar  to  the  graph-theoretic  models  used  in  the 
bottleneck  traveling  salesman  problem.  Leiserson  has  already  derived  bounds 
on  wire  length  and  channel  width  for  two-dimensional  arrays.  We  compared  our 
results  witn  these  bounds.  Leiserson  nicely  provides  us  witn  results  [29] 
that  show  there  is  a  simple,  linear-time  algorithm  to  connect  most  of  the  live 
cells  on  an  N-cell  wafer  into  a  linear  array  using  wires  of  unit  length  1,2, 
or  3  channels  of  unit  width  2. 

5.0  Research  Methodology 

5.1  Optimized  Fault-Tolerant  Designs 

A  four-step  procedure  is  used  from  the  top-level  down.  At  the  first  and 
highest  step,  use  of  an  SBNR  allows  parallel  and  modularized  operation  of  MVL 
arithmetic  processors  for  fast  execution  of  full  precision,  fixed-point 
arithmetic. 

Second,  a  memory-intensive  arithmetic  algorithm  is  employed  wm:r. 
capitalizes  on  tne  snort  internal  word  lengtns  of  SBNR  processors.  ROM-based 
structures  have  been  shown  by  Peled  and  Liu  _30]  to  be  extremely  effective  for 
FIR  filters.  Tms  PI  has  made  the  same  discovery  for  adaptive  filters  using 
combinations  of  ROM's  and  RAM's.  Third,  memory  accesses  within  processors  can 
be  pipelined.  Pourth,  transistor-level  simulation  tools  can  be  employed  to 
design  the  nigh-speed  memory  circuits.  The  capability  of  identifying  failed 
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processor  bits  and  maintaining  correct  DSP  output  in  the  presence  of  errors 
arises  from  our  use  of  an  extensible  algorithm  for  incorporating  redundant 
processor  chips. 

5.1.1  Reliability  Models 

Reliability  models  to  evaluate  computer  systems  must  estimate  multiple 
system  parameters  (e.g.,  failure  rate).  The  quality  of  the  prediction  model 
rests  on  the  estimates  of  its  input  parameters.  In  practice,  extensive 
testing  and  burn-in  procedures  produce  a  best  point  estimate  with  measures  of 
dispersion.  In  a  crude  way,  we  employ  only  a  point  estimate;  e.g.,  the  mean 
or  an  upper  or  lower  bound  [31 ].  Depending  on  the  dispersion  in  its  input 
parameters,  a  reliability  estimate  may  or  may  not  be  acceptable.  For  many 
types  of  computer  components,  where  reliability  is  high  and  failures  are  low, 
the  uncertainty  involved  in  determining  a  parameter  in  question  such  as  the 
failure  rate  may  be  large.  A  ultrareliability  system  necessitates 
investigating  the  ensuing  uncertainty  in  system  reliability.  Unfortunately, 
this  problem  has  not  been  well  studied.  Few  available  reliability  evaluation 
programs  offer  such  sensitivity  estimates. 

Our  model  was  as  follows  (see  [32]).  Assume  that  system  lifetimes  are 
exponentially  distributed.  To  consider  the  dispersion  in  parameter 
estimation,  stipulate  tnat  the  failure  rate  L  is  a  random  variable.  It  is 
doubly  stochastic  £ 33 ] *  The  system  reliability  at  a  given  time,  t,  also 
referred  to  here  as  tlB«  point  reliability,  is  then  a  random  variable  R  (t) 
[with  a  particular  value  r^(t)],  with  the  distribution  in  L.  The  variance  of 
RL(t)  is  crude  but  an  effective  dispersion  measure  to  the  random  nature  of  L. 
We  now  can  exploit  the  model  with  variations  in  failure  rate  for  useful 
properties  of  exponential  distributions.  Use  two  approaches,  an  exact  model 
based  on  the  complete  distribution  of  L  and  an  approximation  of  employs  only 
tne  first  and  second  parameter  distribution  moments. 

Iyer  [32]  has  shown  feasible  exact  and  approximate  models.  The  exact 
model  is  based  on  a  gamma  distribution  and  is  easily  extended  to  fault- 
tolerant  redundancy  configurations,  such  as  TMR,  by  substituting  the 
appropriate  value  for  system  reliability.  Iyer  develops  first  and  second 
moments  for  time  point  reliabilities. 

5.2  Hypergrapn  Models  for  Fault-Tolerant  Systolic  Arrays 

We  proposed  and  used  a  graph  theoretic  procedure  similar  to  [34]  to 
measure  tne  VLSI  effectiveness  of  our  design  strategy  by  the  area  required  to 
lay  out  the  fault-tolerant  processor  arrays.  We  repeat  the  completeness  here 
in  tne  procedure  in  [34].  Three  design  strategies  are  described  briefly. 

1 .  Embed  the  desired  array  in  a  simple  graph  to  model  tecnniques  that  build  a 
fault-to lerant  array.  Each  PE  must  contain  a  robust  switching  mechanism  to 
configure  the  good  PE's  into  an  array  of  the  desired  structure  using  nearest- 
neignbor  connections. 

2.  Embed  the  desired  array  in  a  grapn  with  multipoint  edges  to  build  a  robust 
array  by  running  buses  adjacent  to  the  PE's  and  interconnecting  the  fault-free 
PE's  into  tne  ban*  of  buses  say,  via  laser-welding! .  Use  eacn  array  linn  via 
a  distinct  bu3. 
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3*  Embed  the  desired  array  into  a  "switched"  graph,  whose  vertices  are 
partitioned  into  PE  vertices  and  switch  vertices.  Try  to  realize  inter-PE 
connections  through  a  switching  network  external  to  the  PE's  thereby  allowing 
one  to  bypass  faulty  PE's.  In  comparison  studies  by  [35,36],  no  single  design 
strategy  as  yet  appears  to  be  uniformly  superior  to  any  other. 

Because  of  the  multi-faceted  nature,  a  firm  understanding  of  all  three 
strategies  is  vital.  Methods  1  and  3  have  been  deeply  studied  [36-43]. 
Method  2  was  proposed  [37]  and  tangentially  studied  in  [36]. 

5.2.1  The  Design  Strategy 

As  in  [34],  we  followed  the  same  procedures.  Assume  a  target  array 
structure.  Construct  a  fault- tolerant  array  to  simulate  this  structure.  PE's 
are  represented  by  squares,  and  both  wires  and  buses  are  represented  by  lines 
as  in  Figure  4  [34].  Now  construct  some  number  of  identical  PE's  that  are 
precisely  the  PE's  that  one  would  design  for  the  ideal  array,  with  the  same 
I/O  interfaces.  Next,  lay  the  PE's  out  in  a  (logical,  if  not  physical)  row, 
with  lines  coming  out  of  their  I/O  ports  running  perpendicular  to  the  row  of 
PE's.  Then  run  some  number  of  buses  above  the  row  of  PE's.  We  are  told  (via 
some  unspecified  mechanism)  which  of  the  PE's  are  faulty  and  which  are  fault- 
free.  Now  use  laser  welding  to  connect  I/O  lines  to  buses  in  a  way  that 
configures  the  fault-free  PE's  into  an  array  of  the  desired  structure. 

Use  the  following  area  definition  of  [34] 

area(array)  *  (PE-number)  X  (PE-width)  X  (Bus-depth) 

Let  Bus-depth  be  the  maximum  number  of  buses  passing  over  any  point  in  the 
layout.  (Ignore  the  contribution  of  the  separate  PE's.) 

A  solution  array  has  two  components:  specification  of  the  structure  of 
the  array  and  of  the  configuration  procedure.  The  procedure  is  an  assignment 
operation  mapping  ideal-array  PE's  onto  actual  PE's,  as  well  as  a  mapping  of 
ideal-array  edges/communication  links  to  the  buses  that  will  simulate  them. 

5.3  Comparison  of  Error  Detecting  Codes 

Several  techniques  to  obtain  fault  tolerance  through  error  detection 
have  been  studied.  Most  of  these  schemes  can  be  categorized  as  being  hardware 
redundant  or  time  redundant.  The  hardware  redundant  systems  (for  example, 
Triple  Modular  Redundancy  [44]  and  quadded  logic  [45])  typically  require 
aritnmetic  to  be  computed  in  more  than  one  processor.  A  checker  compares  tne 
results  to  detect  errors.  These  schemes  require  a  factor  of  at  least  2  or  3 
in  hardware  redundancy. 

The  time  redundant  scheme  requires  that  each  result  be  calculated  twice, 
with  the  two  answers  compared  to  find  errors.  Two  examples  of  this  approach 
are  alternating  logic  [46]  and  recomputing  with  snifted  operands  [4?].  In  the 
alternating  logic  technique,  the  result  is  recomputed  from  inverted  operands 
and  should  be  the  inverse  of  the  original  result.  Recomputing  with  shifted 
operands  verifies  that,  when  the  operands  are  shifted,  the  result  contains  a 
snifted  version  of  the  original  bit  pattern.  Both  of  these  systems  are 
effective  primarily  for  stuck-at  faults. 
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A  more  general  approach  than  hardware  or  time  redundancy  is  that  of 
algorithm-based  fault- tolerance  (information  redundancy)  [48].  The  central 
idea  in  this  technique  is  that  the  data  are  encoded  at  the  system  level  in  the 
form  of  some  error-correcting  or  error-detecting  code,  and  the  algorithms  are 
designed  to  operate  on  encoded  input  data  and  produce  encoded  output  data. 
The  result  is  real-time  error  detection  without  a  duplication  of  arithmetic 
processors  or  a  doubled  processing  cycle  time.  The  primary  entrants  in  tnis 
category  are  the  low-cost  residue  and  inverse  residue  codes  [49,50,51 ],  the 
checksum  code  [48,52,53],  and  tne  weighted  checksum  code  [54]. 

5.3.1  Residue  Codes 

Residue  encoding  is  based  on  finding  the  remainder  of  a  sum  of  operand 
digits  evaluated  modulo  N,  where  N  is  a  predetermined  base.  The  binary 
operand  is  broken  into  sections  of  a  bits;  each  section  is  considered  a 
digit.  The  base  of  the  operand  is  found  from  the  digit  size:  N  *  2a-1 .  For 

a  k-bit  number,  all  k/a  digits  are  added  together,  and  the  sum  is  evaluated 
mod  N.  The  remainder  of  this  calculation  is  the  residue  code  for  tne 

operand.  The  simple  residue  code  will  detect  a  fault  in  one  bit,  even  after  a 
repetitive  calculation  like  multiplication.  It  will  also  detect  an  error  if 
up  to  a  consecutive  bits  are  faulty  [49] • 

Avizienis  devised  a  scneme  in  which  two  or  more  residue  digits  are  used 
to  detect  and  then  locate  an  error  [49].  Furthermore,  the  digits  also  check 
each  other — if  only  one  residue  digit  indicates  an  error,  then  that  residue  is 
incorrect— only  if  both  show  an  error  will  a  fault  in  the  number  be  corrected. 

5. 3*1.1  Signed-Digit  Residue  Code 

The  residue  code  has  been  extended  to  include  numbers  expressed  in 
signed-digit  number  representation  [ 50] .  When  a  single  digit  of  an  SDNR 
number  i3  faulty,  any  number  of  (not  necessarily  consecutive)  bits  within  that 
digit  may  be  in  error,  and  the  fault  will  still  be  detected.  The  only 
exceptions  to  this  rule  are  errors  which  add  or  subtract  the  base  N  from  tne 
digit.  For  example,  if  a*4,  and  the  signed  digit  is  changed  from  (9)  to  (- 
6),  the  error  will  go  undetected.  However,  only  a  very  specific  change  in  the 
bit  pattern  will  camouflage  the  fault,  so  detection  is  highly  probable. 
Furthermore,  if  the  number  is  encoded  in  signed  binary  number  representation, 
more  bits  must  be  changed,  and  they  must  be  changed  to  specific  values  of  ( 1 , 
0,-1)  to  hide  the  error.  The  combination  of  SBNB  and  residue 
encoding  thus  appears  to  have  great  fault-tolerance  potential. 

5*3.2  Checksum  Codes 

Unlike  the  researcn  into  residue  codes,  Abraham's  studies  of  checksums 
have  been  directed  specifically  at  matrix  encoding  [48].  The  checksums 
approach  attaches  one  or  more  checksums  to  the  end  of  a  row  or  column  of  a 
matrix.  These  numbers  then  participate  in  all  calculations  as  if  they  were 
just  data.  The  net  effect  on  a  systolic  processor  is  simply  an  increase  in 
the  size  of  each  dimension  of  one  or  two  rows.  No  special  algorithms  are 
needed  to  take  care  of  the  error  codes.  Unfortunately,  cnecksum  coding  was 
introduced  in  the  context  of  floating  point  computers.  Fixed-point 
calculations  (luce  tnose  prevalent  in  nigh-speed  dedicated  signal  processors'. 
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require  a  slightly  more  difficult  coding  scheme,  because  a  full-precision 
cnecksum  would  overflow  a  fixed-point  system.  In  situations  where  the 
implementation  depends  on  the  number  system,  this  report  assumes  fixed-point 
operations. 

5. 3. 2.1  Unweighted  ChecKsums 

In  the  simplest  checksum  code,  an  unweighted  checksum  is  formed  by 
adding  together  all  elements  in  a  row  or  column  of  a  matrix.  Overflow  bits 
from  this  addition  are  ignored.  Depending  on  the  applications,  just  row  or 
column  encoding  may  be  sufficient,  or  both  may  be  needed.  The  unweighted 
checksum  will  detect  a  single  error  in  the  row  or  column.  It  is  effective  in 
LU  decomposition,  matrix  inversion,  matrix-vector  multiplication,  matrix- 
scalar  multiplication  and  singular  value  decomposition  [48]  [53]. 

5. 3 *2. 2  Weighted  Checksums 

To  achieve  error  correcting  capability,  the  checKsum  must  positionally 
weight  the  addends.  The  result  is  the  weighted  checksum  [54].  In  this  system 
each  element  of  a  vector  within  the  matrix  is  multiplied  by  a  different 
weighting  factor  before  being  added  to  form  the  checksum.  The  simplest 
weighting  scheme  consists  of  powers  of  2--tne  elements  e(i)  would  be 
multiplied  by  weight  21  (left-shifted  i  bit  positions),  for  example.  For  a 
fixed-point  system  with  numbers  of  length  k  bits,  the  sum  would  quickly 
overflow,  so  it  is  added  modulo  a  specific  base.  Unlike  the  residue  code,  the 
base  for  weighted  checksums  is  the  largest  prime  number  less  than  2k+T.  For 
1 6-bit  systems,  this  number  is  131  059,  and  for  32-bit,  it  is  8  589  934  583. 

To  allow  correction  of  errors,  the  weighted  checksum  vector  must  be 
augmented  with  a  vector  of  unweighted  checksums.  Thus,  as  with  residue 
encoding,  if  one  checksum  detects  an  error,  the  checksum  is  incorrect;  if  they 
both  do,  the  error  in  the  data  may  be  located  and  corrected.  The  weighted 
checksums  technique  can  correct  errors  in  matrix  multiplication  with  a  matrix, 
vector  or  scalar,  matrix  inversion,  and  LU  decomposition  (by  Gaussian 
elimination) . 

5.3.3  Comparison  of  Fault-Tolerant  Implementations 

To  compute  any  of  the  error-detecting  codes  described  requires  adder 
trees  to  sum  tne  digits  or  elements.  In  residue  encoding,  an  end-around-carry 
is  generated  within  the  adders.  In  checksum  encoding,  the  overflow  carries 
are  thrown  away.  In  weighted  checKsum  encoding,  each  level  of  addition  is 
performed  modulo  the  prime  base.  Thus,  for  residue  and  unweighted  checksum, 
tne  areas  are  almost  the  same  for  a  length-n  column  of  additions — 0(n) — and 
the  add  time  is  identical— 0( log_n) •  For  weighted  checksum,  each  adder  must 
compare  its  sum  to  tne  base,  ana  subtract  the  base  if  necessary.  The  adder 
can  also  be  used  for  this  subtraction,  so  the  area  remains  0(n),  but  the 
double  add  cycle  means  that  the  relative  time  is  O^log^n). 

In  an  nxn  matrix  with  row  and  column  encoding,  the  area- time  complexity 
for  the  first  two  cases  is  0(n2log2n).  For  the  weighted  checksums,  it  is 
0(2n2log_n).  Keep  in  mind  that  for  error  correcting  capability,  the  A-T 
product  of  the  residue  is  doubled,  while  tnose  of  the  weighted  and  unweighted 
cnecKSums  are  added  together. 
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Another  consideration  is  the  ease  with  which  algorithms  may  be  adapted 
to  allow  fault-tolerance.  In  the  case  of  residue  encoding,  the  residue  digits 
must  be  handled  separately,  which  increases  algorithm  complexity.  The 
checksums  system,  however,  merely  increases  slightly  the  size  of  the  input  to 
the  algorithm,  with  no  special  treatment  given  the  checksums  themselves. 


Figure  4.  The  Fault-Tolerant  Line  Hypergraph  [34] 

5.4  Systolic  Array  PE  Study 

Systolic  arrays  may  be  configured  as  shown  in  Figure  4.  They  include 
rectangular,  hexagonal,  or  linear  systolic  arrays.  The  most  likely  use  for 
each  configuration  is  indicated.  In  the  application  domain  of  beamformers  for 
towed  arrays  (for  example)  it  is  suggested  by  many  researchers  that  a 
triangular  array  is  preferable.  Unfortunately,  all  currently  available 
systolic  arrays  including  the  NCR  GAPP  (Geometric  Arithmetic  Parallel 
Processor),  incorporating  72  PE's  are  configurable  in  rectilinear  (6  X  12 
units)  not  triangular  fashion.  Hence,  a  triangular  array  configuration 
although  optimal  from  an  algorithmic  standpoint  (e.g.,  recursive  LS)  does  not 
efficiently  utilize  commercial  arrays. 
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MATRIX-VECTOR  MULTIPLICATION 
SOLUTION  Of  TRIANGULAR  UNCAR  SYSTEMS 


Figure  5.  Systolic  Array  Solutions 


22 


Our  methodology  was  to  map  a  class  of  algorithms  onto  a  non-existing 
machine.  To  do  so,  we  must  first  specify  the  design  constraints  such  as 
circuit  switching  speed,  propagation  delay  throughput,  maximum  number  of 
gates,  etc.  Next,  we  must  bound  the  algorithm  classes.  Within  each  class  we 
should  first  determine  the  greatest  common  denominator  or  building  block.  In 
adaptive  signal  processors,  the  inner  product  processor  appeara  to  be  a 
suitable  common  denominator  and  starting  point.  Our  strategy  then  is  to 
derive  near  optimal  algorithms  invoicing  this  basic  signal  processing 
operation,  and  then  map  the  fast  algorithms  onto  new  VLSI  circuits.  of 
course,  not  every  algorithm  may  be  based  on  this  sole  operation. 

The  methodology  is  a  two-step  process.  In  the  first  step,  we  want  to 
obtain  fast  adaptive  signal  processing  algorithms.  Here  we  will  bound  the 
problem  to  study  recursive  and  non-recursive  adaptive  algorithms.  During  tnis 
step,  we  will  be  sensitive  to  the  computational  processes  which  are  expensive, 
such  as  matrix  manipulations.  Realizing  that  recursive  algorithms  contain 
basic  computational  tasks,  identical  to  non-recursive  algorithms,  we  will 
begin  with  non- recursive  algorithms.  Here,  we  want  to  identify  inherent 
parallelism  possible  with  adaptive  signal  processing  algorithms. 

The  inherent  parallel  nature  of  an  algorithm  is  then  displayed  by 
mapping  the  initial  adaptive  algorithm  into  a  sequential  set  of  tasks 
(commonly  called  "straight-line"  algorithms)  to  be  represented  by  a  directed 
acyclic  flow  graph  (DAG),  each  node  being  a  task  (multiply,  divide,  etc.)  and 
each  edge  or  vertex  representing  a  data  dependency  relation.  That  is, 
briefly,  predecessor  nodes  compute  data  needed  by  their  successor  nodes.  From 
this  grapnical  setting,  we  can  reduce  the  longest  or  critical  path  by  hand  (if 
obvious)  or  by  computer  (using  well-known  graph  reduction  algorithms,  c.f. 
Chapter  6  of  [55]).  Any  concurrency  so  identified  will  provide  us  with  speed¬ 
up  via  parallelism.  One  method  to  obtain  concurrency  is  to  use  the  adjacency 
matrix  of  the  flowgraph  to  compute  the  earliest  and  latest  precedence 
relationships.  Map  these  onto  a  resource  matrix  (machine  environment  such  as 
number  of  adders,  subtracters,  multipliers,  convolvers,  etc.)  to  identify 
concurrency.  Another  method  is  the  divide-and-conquer  scheme  proved 
successfully  in  polynomial  multiplication. 

To  date,  the  complexity  of  most  signal  processing  algorithms  has  been 
estimated  from  their  number  of  multiplications  and  sometimes  from  their  number 
of  additions.  This  is  not  always  prudent.  In  fact,  we  should  say  an 
algorithm  is  deemed  to  be  efficient  if  its  final  implemented  form  takes 
minimal  time.  The  execution  time  consists  of  data  snuffling  operations  as 
well  as  arithmetic  operations.  Hence,  algorithms  with  fewer  mathematical 
operations  alone  may  not  always  be  tne  best  in  its  final  implemented  form. 
Fast  algorithms  identified  in  this  step  will  most  likely  be  modified  later  to 
insure  optimal  implementation.  However,  these  initial  results  will  serve  as  a 
good  starting  point.  The  best  approach  seems  to  be  to  first  design  an 
algorithm  which  is  efficient  in  terms  of  the  number  of  mathematical 
operations,  and  then  modify  it  to  take  full  advantage  of  VLSI  characteristics. 

As  Lamagna  [56]  has  pointed  out,  "The  straight-line  algorithm  paradigm 
neglects  the  cost  of  the  overhead  associated  with  loop  control  and  testing 
operations,  as  well  as  tne  time  required  to  fetcn  and  store  information  inside 
a  computer's  memory.  These  costs  can  vary  greatly  from  computer  to  computer 
and  will  not  even  be  tne  same  for  two  programming  language  compilers 
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implemented  on  tne  same  machine.  Fortunately,  the  overall  times  of  tne 
algorithms  studied  are  driven  primarily  by  the  underlying  structure  of  the 
arithmetic  operations  performed,  rather  than  such  overhead  considerations,  so 
the  results  obtained  are  generally  accurate  to  within  a  small  constant  factor 
for  actual  implementations." 

A  second  step  is  to  organize  the  VLSI  for  the  fast  adaptive  algorithms. 
One  basic  building  block  is  the  inner  product  processor.  Rectangular  and 
hexagonal  geometries  can  be  incorporated.  We  intend  to  organize  the  inner 
product  processor  as  efficiently  as  possible  in  array  structures  in  order  to 
capture  the  inherent  pipelining,  parallelism,  and  recursive  nature  of  the 
adaptive  signal  algorithms.  We  anticipate  that  cyclic  convolution  and  a 
cyclic  convolver  may  be  quite  beneficial  in  casting  the  algorithms  into  VLSI. 

The  basic  multiplication  step  itself  was  examined.  The  employment  of 
distributed  arithmetic  implementation  [30,57,58]  successful  for  fixed-point 
digital  filters  was  evaluated  on  an  area  x  time  basis  for  adaptive  algorithms. 
Because  adaptive  algorithms,  like  filter  algorithms,  are  essentially  finite 
state  machines,  the  multiplication  and  addition  steps  can  be  replaced  by  a 
partial  table  look-up  of  precalculated  products.  The  analysis  is,  then,  a 
trade-off  between  multipliers  and  memory  space  on  the  chip.  This  comparative 
analysis  is  not  trivial  since  the  recursive  nature  of  HR  adaptive  algorithms 
forces  us  to  compute  an  entire  result  before  reloading  data  registers 
(temporary  scratcnpad  space)  to  generate  the  next  table  look-up  entry. 
However,  some  pipelining  is  possible  and  can  be  exploited  as  much  as  possible. 

A  tentative  method  to  wire  up  the  algorithms  is  to  use  the  "evaluation- 
interpolation"  method  successfully  employed  in  [59]  to  obtain  (area  X  time) 
optimal  convolvers  by  observing  the  necessary  algebraic  steps  and  polynomial 
evaluations  that  can  be  cast  directly  into  a  parallel  computational  process. 
These  algebraic  steps,  as  organized,  nicely  prescribe  optimal  VLSI  structures. 
Computational  tasks  can  be  divided  into  those  circuits  which  are  amenable  to 
regular  and  simple  interconnections  and  those  which  are  not.  Matrix 
multiplication  tasks  obviously  can  be  regularized.  ADC,  DAC,  and  other  analog 
computational  tasks  are  not  amenable  to  regular  structures  as  we  presently 
know  them.  The  control  circuits  (such  as  found  in  firmware-oriented 
architectures)  are  amenable  to  regular  implementations. 

5.5  Error  Tolerant  Design  with  Multi-Valued  Logic  (MVL)  Circuits 

In  this  research,  the  effectiveness  of  MVL  circuits  realizing  signed 
binary  number  arithmetic  must  be  considered  with  respect  to  the  inherent 
fault-tolerance  of  MVL  circuits.  Polylogic  logic  circuits,  of  which  MVL  is 
one  case,  have  been  studied  by  Porter  [60 ]  for  intrinsic  error  tolerance.  Our 
work  plan  is  to  use  his  technique  to  prove  out  low  rejection  rate  and/or 
reduced  component  stringency  requirements.  Here,  logic  circuit  failures  (such 
as  "stuck-at  faults")  and  the  effects  of  resistor,  capacitor,  and  inductor 
error  values  (necessary  for  hybrid  signal  processors  which  incorporate  ADC's 
and  DAC's)  should  be  studied.  Note  tnat  possibly  small  fluctuations  in 
component  values  are  not  appropriate  for  binary  circuits;  however,  they  are 
quite  relevant  and  natural  in  MVL.  Polylogic  families  include  binary,  multi¬ 
value,  and  threshold  logic. 


The  procedure  is  to  define  a  finite  alphabet  (R)  and  identify  the  set  of 
all  possible  values  of  switching  tuplets  (Rn).  Then,  multilinear  mappings 
onto  the  finite  R  are  sought  which  eventually  produce  polynomic  realizations 
of  tne  desired  switching  function.  The  major  question  is,  "Does  any  mapping 
exist,  if  component  errors  are  modeled  and  included  in  our  polylogic  MVL 
subset?"  Porter  has  already  shown  that  such  mappings  exist  and,  in  fact, 
several  do.  Hence,  a  circuit  designer  can  choose  among  the  more  optimal 
circuits,  performing  engineering  trade-offs  as  needed.  This  is  tne 
flexibility  possible  in  this  research  to  obtain  fault- tolerant  MVL  circuits 
that  are  optimal  in  tne  sense  of  circuit  complexity,  power,  and  speed. 

5.6  Design  for  Testability 

Efficient  test  generation  for  logic  circuits  is  a  matter  of  prime 
importance  [61 -63].  Yet  most  major  fault  detection  problems  are  to  be  NP- 
complete,  generally.  MVL  is  no  exception.  Hence,  design  for  testability  is 
necessary.  Built-In- Test  (BIT)  circuits  are  highly  desired.  The  study  by 
Fujiwara  and  Toida  [64]  can  be  used  to  compare  our  fault-tolerant  "testable" 
designs  with  their  benchmark  complexities.  They  also  provided  clever 
procedures  to  insert  a  few  additional  test-points  into  an  arbitrary  circuit  to 
make  it  easily  testable.  Heavy  use  of  PLA'3  is  made.  Their  studies  show  that 
some  circuits  (linear  circuits,  decoders,  parallel  adders,  ...)  can  be 
"tested"  in  polynomial  time. 

5.7  Redundancy  for  Increased  Yield 

Typically,  for  a  new  and  complex  device,  most  of  the  chips  from  a 
manufacturing  batch  contain  defects.  The  yield  is  quite  low.  This  can  affect 
both  cost  and  reliability. 

During  the  fabrication  process,  defects  that  can  result  in  faults  can 
occur  at  any  time  during  processing.  For  a  chip  at  N  circuits,  there  will  be 
1.  fault-causing  defects  introduced  during  fabrication  process  step  i.  In 
ail  we  can  expect  to  find  L»  I  1  ;  fault-causing  defects  (14).  If  all  these 
defects  are  Poisson  distributed,  then  the  yield  will  be  given  by 

Y  -  e“L  (14) 

assuming  random  point  defects  are  our  only  yield  detractors.  In  general, 
fault-causing  defects  will  not  be  randomly  distributed  but  will  be  clustered. 
Furthermore,  the  clustering  nature  will  vary  from  step  to  step.  Nonetheless, 
the  assumption  of  gamma  distributed  defects,  where  the  same  clustering 
parameter,  a,  characterizes  all  tne  defects,  leads  to  the  following  yield 
formula  (15)  that  has  been  successfully  used  to  model  a  large  body  of  data. 

Y  «  0  ♦  Va)"S  (15) 

where  L^  is  the  average  number  of  fault-causing  defects  per  chip.  In  the 
limit  that  a — >  00  ,  (15)  reduces  to  (14).  In  actual  situations,  a  is 
typically  in  the  range  1/2  -  4,  and  the  yield  can  be  appreciably  better  than 
predicted  by  (14).  In  tne  case  of  redundant  designs  as  may  be  required  for 
Wafer  Scale  Integration  (WSl),  the  calculation  of  yield  becomes  more  complex, 
and  the  role  of  clustering  and  correlation  of  defects  becomes  even  more 
important . 
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Yield  projections  are  a  primary  consideration  in  wafer  scale 
integration.  Ketchen  [65]  develops  a  point  defect  yield  model  for  a  two-way 
redundancy  scheme  appropriate  for  random  logic.  The  model  assumes  that  the 
fault-causing  defects  are  randomly  distributed  locally  but  that  the  defect 
density  can  vary  across  a  wafer  as  well  as  from  one  wafer  to  another.  Tne 
importance  of  the  distinction  between  on-wafer  and  wafer- to-wafer  variations 
in  defect  density  is  demonstrated.  This  model  demonstrates  tne  dependence  of 
yield  on  the  nature  of  the  defects,  and,  together  with  gross  yield  estimates 
and  the  appropriate  nonredundant  yield  factor,  it  will  serve  as  a  good 
starting  point  to  model  actual  yield  data.  The  existence  of  complex  local 
correlations  and  some  non- point-like  defects  will  clearly  complicate  matters; 
although,  in  many  cases,  a  perturbative  approach  is  adequate  to  model  the 
situation. 

Redundancy  can  be  used  to  improve  the  yield  significantly.  Such  methods 
are  commonly  used  for  memory  chips.  Faulty  components  are  left  out  of  final 
interconnection.  The  strategies  used  include  eliminating  affected  row  and 
column,  or  eliminating  the  affected  half.  A  processor  array  can  be 
reconfigured  in  more  complex  arrays  [66-68].  To  obtain  an  array  of  specified 
dimensions,  one  would  then  start  with  a  larger  and  thus  redundant  array. 
Redundancy  does  not  always  increase  the  yield,  because  the  larger  chip  area 
required  tends  to  decrease  the  yield.  Using  Koren  and  Breuer's  approach  [66], 
expressions  for  yields  for  both  simple  and  fault-tolerant  arrays  can  be 
obtained,  and  optimal  designs  which  maximize  yield  can  be  obtained.  Faults 
affecting  both  PE's  and  interconnections  have  to  be  considered. 

5.8  Fault-Tolerance  for  Higher  Reliability 

Dynamic  reconfiguration  can  be  used  to  overcome  hard  faults  occurring  in 
the  field.  Before  any  PE's  are  removed  from  active  configuration,  it  would  be 
necessary  to  detect  such  failures.  To  control  error  propagation,  such 
detection  should  have  low  latency  (time  between  error  occurrence  and  external 
fault  manifestation).  For  concurrent  testing,  duplication  is  the  most 
effective  technique;  however,  it  can  significantly  reduce  yield.  It  has  been 
shown  that  some  self-checking  with  limited  redundancy  can  significantly 
improve  yield  [66]. 

Dynamic  reconfiguration  in  processor  arrays  can  be  done  in  different 
ways  [67].  Because  reconfiguration  results  in  fewer  PE's,  there  is  some 
degradation  in  performance.  The  effectiveness  of  different  schemes  depends  on 
efficiency  of  partitioning  of  algorithms  for  execution  on  reduced  size  arrays. 
Trade-offs  between  reconfiguration  strategies  must  consider  optimizing, 
reliability,  coverage,  perf ormability  [69]  and  computational  availability. 

Processor  arrays  can  support  a  special  form  of  fault-tolerance.  In 
real-time  applications,  successive  data  points  can  exhibit  considerable 
correlation.  A  sudden  and  significant  change  in  a  point  array  may  suggest  the 
onset  of  a  soft  (temporary)  or  hard  fault.  Correctness  of  the  value  can  be 
confirmed  by  recomputation  (which  is  a  form  of  time  redundancy);  however,  the 
same  faulty  hardware,  because  of  a  hard  or  a  long  soft  fault,  is  likely  to 
generate  the  same  incorrect  result.  However,  in  a  processed  array, 
recomputation  can  be  done  by  mapping  the  process  to  a  shifted  set  of  PE’s.  In 
this  case,  a  faulty  PE  will  almost  always  generate  a  different  result. 
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A  considerable  degree  of  fault-tolerance  can  be  achieved  by  encoding 
information.  Kuang  and  Abraham  have  described  a  scheme  for  matrix 
multiplication  with  processor  arrays  whicn  requires  only  limited  hardware  and 
time  redundancy  [52].  Suitable  SDNR  arrays  are  available. 

5.9  Hard/Soft  Errors 

Reliability  with  respect  to  hard  and  soft  faults  will  be  considered 
separately.  While  methods  exist  which  use  the  same  measure  to  include  effects 
of  both  types  of  faults,  such  a  measure  can  be  hard  to  interpret. 

For  binary  devices,  the  failure  rate  is  generally  estimated  by  using 
techniques  in  MIL-HDBK  217  and  its  updates.  The  fact  tnat  tne  learning  and 
the  quality  factors  alone  can  change  the  result  drastically  suggests  that 
exact  results  can  not  be  expected.  By  characterizing  L  (see  Section  5.1.1) 
itself  by  a  statistical  distribution,  these  limitations  can  be  taken  into 
account . 


It  can  be  expected  that  the  failure  rate  data  for  binary  devices  is  not 
directly  applicable  to  ternary  devices.  The  physical  degradation,  that  will 
not  cause  a  logical  failure  in  a  binary  device,  may  cause  a  failure  in  a 
ternary  device.  On  the  other  hand,  a  ternary  device  uses  fewer  logical  nodes, 
interconnections  and  specialty  pins,  which  can  significantly  ennance 
reliability.  How  the  available  data  on  failure  rates  for  binary  devices  can 
be  adapted  for  ternary  devices  will  be  a  problem  to  be  examined. 

The  alpha-particles  have  been  a  major  cause  of  soft  failures.  However, 
now  they  can  be  very  effectively  combated  by  proper  choice  of  encapsulating 
material  and  by  coating.  Also,  the  new  CHMOS  technology  is  remarkably  robust 
against  alpha-particles.  Various  types  of  noise  [ 70 ]  remain  a  problem.  Here 
reduced  noise  immunity  makes  noise  an  important  consideration.  Soft  failure 
rates  due  to  such  causes  can  be  estimated  satisfactorily,  but  assumptions 
remain  to  be  examined. 


5.10  Design-For-Testability  and  Built-In-Seif-Testing 


Efficient  test  generation  for  logic  circuits  is  now  recognized  to  be  a 
matter  of  prime  importance  [61-65].  Yet  most  major  fault  detection  problems 
are  generally  NP-complete.  The  proposed  MVL  PE  array  is  no  exception. 


In  a  regular  array,  there  are  two  major  testing  considerations.  One  is 
how  to  test  a  single  PE  element,  assuming  its  inputs  and  outputs  can  be 
directly  accessed.  Next,  part  of  the  problem  is  how  to  exercise  each  PE 
element  when  they  form  a  regular  array.  Some  arrays  possess  a  special  feature 
called  C-testability  [71  j.  A  C-testable  array  can  be  tested  by  wiring  a  fixed 
number  of  tests,  regardless  of  the  dimensions  of  the  array.  It  has  shown  that 
often  arrays  which  are  not  C-testable,  can  be  made  so  by  using  only  minor 
modifications  w72_,. 


Several  scan-path 
reduce  the  problems  of 
combinational  circuits. 
The  scan-path  techniques 


techniques  like  L3SD  have  been  suggested.  These 
testing  sequential  circuits  to  that  of  testing  purely 
This  enormously  simplifies  test-pattern  generation, 
are  also  applicable  for  PE  arrays.  An  implementation 
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has  been  described  for  a  CMOS,  two's  complement  serial  convolver  cnip  [73]. 
Applicability  of  Ternary-Scan-Design,  as  proposed  in  [74],  for  our  proposed 
scheme  is  relevant. 

The  study  of  Fujiwara  and  Tioda  [64]  compares  fault-tolerant  "testable" 
design  with  benchmark  complexities.  They  also  provide  clever  procedures  to 
insert  a  few  additional  test-points  into  an  arbitrary  circuit  to  make  it 
easily  testable.  Heavy  use  of  PLA's  is  made.  Their  studies  show  that  some 
circuits,  decoders,  parallel  adders,  ...  can  be  "tested"  in  polynomial  time. 

Built-In-Seif-Test  circuitry  allows  a  device  to  test  itself  without 
using  expensive  test  equipment.  It  is  also  valuable  for  assuring  device 
integrity  in  the  field.  For  a  PE  array,  BIST  must  be  incorporated  within  each 
element.  It  is  also  necessary  to  have  the  necessary  circuits  to  support  BIST 
globally  so  that  the  interconnections  are  tested,  and  also  the  go/no  go 
information  is  routed  to  some  external  output  or  outputs. 

5.11  Information  Redundancy 

Low  cost  residue  and  inverse  residue  codes  for  error  detection  in 
signed-digit  arithmetic  were  proposed  for  this  project.  These  codes 
capitalize  on  the  fact  that  they  can  be  used  to  check  storage,  transmission, 
and  computing  functions  using  the  same  checking  algorithms.  These  algorithms 
compute  tne  module:  a  residue  of  messages,  operands,  and  results  in  a  serial 
or  parallel  fashion.  The  residue  digits  are  then  tested  to  indicate  whether 
or  not  an  error  exists  [50]. 

As  noted  by  Aviziems  [50],  the  effectiveness  of  Signed  Digit  Residue 
Codes  (SDRC)  can  be  assessed  by  observing  that  undetectable  errors  are  caused 
only  by  faults  that  change  the  value  of  the  signed-digit  number  by  a  multiple 
of  2  -1  (where  2b  is  the  radix).  Such  changes  are  highly  unlikely.  A 
detailed  study  of  effectiveness  requires  the  full  Knowledge  of  tne  internal 
representation  of  digit  values  and  an  analysis  of  the  effects  of  repeated-use 
faults  when  they  may  affect  the  operands  or  the  result. 

The  algorithms  proposed  in  [ 50 ]  only  employ  one  residue  digit  for  an 
entire  K  digit  SD  operand.  While  this  minimizes  the  cost  of  encoding,  it  may 
be  inconvenient  in  variable-precision  operations  that  generate  the  most- 
significant-digits  of  tne  results  first  and  that  are  "chained",  executing 
further  operations  on  high-significance-digits  of  an  intermediate  result  X 
even  before  the  lower-significance-digi ts  become  available.  The  Serial 
ChecKing  Algorithm  is  completed  only  after  all  digits  of  X  have  been  obtained, 
tne  residue  digit  X  is  tnem  computed  and  compared  to  test  for  the  presence  of 
an  error. 

An  error  indication  requires  the  cancellation  of  all  results  that  have 
used  at  least  one  digit  of  X.  The  cancellation  must  reach  k+3  digit  levels 
downstream  in  the  chain  and  identify  all  potentially  erroneous  results.  Two 
solutions  may  be  applied  to  shorten  the  "span"  of  the  cancellation  tnat  must 
follow  an  error  indication:  (a)  the  segmentation  of  operands  into  check 
segments,  and  (b)  single-digit  encoding  that  employs  a  checking  element 
within  each  arithmetic  unit  that  performs  single-digit  operations. 


Segmentation  divides  the  k-digit  operand  into  cneck  segments  of  p 
digits  lengtn  each  and  attaches  one  residue  digit  to  the  right  end  of  each 
check  segment,  rather  than  using  one  residue  digit  at  the  end  of  the  entire 
operand.  The  cancellation  span  is  reduced  to  p+3  digit  levels  downstream  in 
the  chain.  Furthermore,  error  detection  effectiveness  in  the  case  of 
repeated-use  faults  may  be  increased  because  of  the  snorter  lengtn  of  tne 
segment  being  checked.  The  cost  of  segmentation  consists  of  the  extra  time 
and  storage  required  by  the  proliferation  of  residue  digits. 

Single-Digit  Encoding  appears  most  suitable  for  VLSI- implemented 
arithmetic  units  tnat  can  execute  the  algorithms  for  digits  o|  S-D 
representations  with  relatively  large  radices,  such  as  r»2  ,  r*>2  ,  r»iOy,  or 

even  greater  values.  Here  each  individual  S-D  representation  digit  carries 
its  residue  digit  modulo  (r‘-l),  where 

r*  *  2^  when  r  »  2^,  and  b  »  kq  (q2.2)  (16) 

r*  «  10^  when  r  -  10c,  and  c  *  kq  (q>_l)  (17) 

The  single-digit  encoding  approach  is  an  extension  of  the  segmentation 
concept.  Each  digit  of  tne  radix  r  *  2°  is  treated  similarly  to  a  k  digits 
long  segment  of  the  radix  2q  representation  that  is  checked  by  one  modulo  2q- 
1  residue  digit.  The  evident  advantage  of  this  approach  is  the  pinpointing  of 
the  error  to  tne  single  arithmetic  unit. 

5.12  Hardware  Redundancy 

Some  drawbacks  of  arithmetic  codes  are  their  inability  to  detect  errors 
in  logical  operations,  and  single  errors  in  group  carry-lookahead  structures 
[47].  The  latter  is  not  a  problem  if  SBNR  is  used.  Thus,  hardware  redundancy 
has  been  recognized  as  the  most  effective  technique  to  identify  faults  in 
logical  operations.  In  [47],  Patel  and  Fung  describe  a  technique  in  which 
coding  and  decoding  functions  (in  the  form  of  shift  left  and  shift  right)  are 
employed.  Here,  the  arithmetic/logic  operation  is  performed  twice.  The  first 
time  it  is  performed  without  shifting,  and  the  results  are  stored  in  a  general 
register.  The  second  time,  the  inverse  shift  operation  is  executed  and  tnen 
compared  with  the  contents  of  the  storage  register.  A  mismatch  indicates  an 
error  in  computation. 

The  hardware  redundancy  technique  described  has  been  implemented  m 
binary  number  systems.  Nevertheless,  it  is  prudent  to  assume  that  it,  or  any 
other  binary  technique,  can  be  adapted  to  SBNR  architectures.  Of  course,  a 
trade-off  study  of  cost  versus  circuit  complexity  should  also  be  completed. 

The  binary  fault-tolerant  ALU  implemented  by  Patel  and  Fung  can  be 
constructed  using  a  CMOS  family  of  ternary  logic  circuits.  These  circuits, 
proposed  by  Mouftan  and  Heung  L"75 . ,  use  two  power  supplies,  eacn  below  tne 
transistors  threshold  voltage,  and  do  not  include  resistors.  All  transistors 
are  5  microm  x  5  raicrom.  The  threshold  voltages  for  the  p-channel  and  n- 
channel  enhancement- type  transistors  are  -1v  and  +1v.  They  have  opposite 
polarity  for  tne  depletion-type  devices.  With  tne  use  of  voltage  power 
supplies  below  the  transistors  turn-on  voltage  and  the  exclusion  of  resistors, 
it  is  possible  to  implement  t r.  1 3  circuitry  m  VLSI.  Added  features  include 


29 


low  power  consumption,  high  speed,  and  comparable  performance  to 
counterparts. 


bi  r.a  rv 


For  tne  ALU  proposed  in  t4"j  to  be  fault- to lerant ,  tne  encoder,  decoder, 
and  comparator  circuitry  must  be  Totally  Self-Checking  (TSC).  They  can  be 
implemented  with  PLA's.  One  advantage  of  using  PLA’s  is  tnat  tneir  regular 
structure  simplifies  analysis  of  the  effects  of  faults  on  their  output  and 
tnerefore  facilitates  test  vector  generation  and  determination  of  fault 
coverage. 

The  most  elementary  fault  model  used  for  PLA's  includes  three  types  of 
faults: 

1.  Stuck-at  faults  on  an  input  line,  product  term  line,  or  output  line. 

2.  A  short  between  two  adjacent  or  crossing  lines  tnat  forces  both  of  tnem  to 
tne  same  logic  value. 

3.  A  missing  or  extra  crosspoint  device  in  tne  AND  array  or  in  tne  OR  array. 

Since  breaks  in  lines  (that  are  not  equivalent  to  stuck-at  faults)  are 
one  of  tne  main  causes  of  failures  in  VLSI  circuits  [76-'7'7],  it  is  clear  that 
the  above  simple  fault  model  does  not  accurately  reflect  the  possible  physical 
defects  in  an  MOS  PLA.  A  more  complete  fault  model  is  given  in  _78,. 

6.0  Research  Results  of  Current  Period 

6.1  SBNR  Architectures 

This  PI  nas  been  investigating  the  Least-Mean-Square  (LMS)  adaptive 
filter  algorithm  for  signal  processors  [79-31].  Recently,  these  studies  have 
focused  on  redundant  arithmetic  implementations  in  distributed  and  systolic 
array  architectures  [20,82, 83j.  It  has  been  discovered  that  some  of  the 
inherent  borrow/carry  propagation  properties  tend  to  make  implementations  very 
compact  and  modular.  This  tends  to  suggest  that  f ault-tolerant  properties 
abound  for  SBNR  realizations.  As  early  as  the  ILLIAC  III,  AtKins  _3.  snowed 
that  higher  radix  implementations  (of  which  SBNR  is  a  reduced  yet  very 
powerful  subset)  produced  superior  fault-tolerant  arithmetic  engines  wnen 
using  redundant  or  signed  number  codes. 

The  papers  in  tne  Appendix  by  corporate  personnel  nave  demonstrated  seme 
of  the  advantages  to  SBNR.  Note  particularly  that  others  are  identifying 
similar  advantages.  Sicuranza  and  Ramponi  ^34  _  also  exploit  memory-oriented 

structures  properly  matching  the  characteristics  of  distributed  arithmetic  for 
adaptive  nonlinear  filters  described  by  truncated  discrete  Volterra  series. 
Their  use  of  offset  binary  code  (a  form  of  SBNR)  and  address  splitting 
(available  to  SBNR)  establisned  efficient,  altnough  dedicated,  architectures. 
They,  as  well  as  us,  show  that  the  memory  dimension  is  not  (2n)*  words 
because  of  the  dramatic  reductions  possible  with  SBNR  and  symmetry. 

Anotner  promising  approach  to  efficient  implementations  of  redundant 

number  realizations  is  described  by  Gwens  and  Irwin  _35„.  Here,  a  primitive 

cell,  including  its  operation  suite,  are  used  m  a  DF7  application 

demonstrating  the  highly  regular  array  structures  achieving  good  AT"  bounds. 
They  partitioned  functions  into  "interface,  storage,  or  aritr.metic"  to 

implement  digit-on-line  algorithms.  We  car.  exploit  tne  same  digit-cr- line 


properties  for  ultra-fast  processing  of  analog  signa.s.  For  example, 
digital  signal  processing  can  begin  as  soon  as  the  MSB  (which  is 
the  first  bit)  is  converted  by  the  ABC!  For  bit  serial  distributed 
arithmetic  scnemes,  Linderman  L36^  nas  snown  now  clocic  rates  of  "C  MHz  are 
possible  nere.  Denyer  _8'\,  Denyer  and  Rensnaw  ^Se_,  Jaggernautn ,  et.  al . 
."’2.,  and  otners  _99.  make  similar  promising  discoveries  about  bit-serial 
implementations .  Particularly  encouraging  is  t.ne  C’JSP  (digital  signal 
processor,  VLSI)  of  Linderman,  et.  al.  t9C_,  since  this  deice  is  a  sixteen 
2C-bit  serial  multiplier  by  24  serial  adder/subtracter,  driven  by  a  =  1  MHz 
two-phase  non-overlapping  clocx.  This  device  again  exemplifies  one  power  of 
bit-serial  approaches. 

b.'.l  Design  of  S3NR  Array  Multiplier 


The  need  for  mgn-speed  computation  nas  spurred  much  researcr.  into 
various  forms  of  parallel  processing.  The  two  most  common  of  these 
architectures  in  signal  processing  applications  are  t.ne  pipelined  procesaor 
and  the  array  processor.  Developments  in  parallelism  nave  become  quite 
popular  witn  the  revival  of  interest  m  t.ne  Signed-Digit  Number  Representation 
(SDNR)  characterized  by  Aviziems  [l_.  Implementation  of  the  faster 
architectures  m  VLSI  is  a  concern  for  devices  needing  powerful  processing  in 
limited  space  (e.g.,  mobile,  self-contained  and  space-based  vehicles. 

The  design  described  below,  and  snown  m  Figure  6 ,  is  a  systolic  array 
for  matrix  multiplication  which  is  compatible  with  digit  online  architectures 
^91  _.  This  array  is  similar  to  one  described  by  Irwin  _  92  ]  in  wmon  two 
vectors  to  oe  multiplied  enter  the  array  most-significant-bit-first  —  the 
distinguishing  characteristic  of  online  networks.  The  array  uses  t.ne 
Processing  Element  (PE),  diagrammed  in  Figure  7,  and  shown  schematically  m 
Figures  8  and  9,  to  perform  bitwise  vector  multiplication.  For  matrix  x 
matrix  multiplication  a  parallel  multiply/accumulate  element  may  be 
substituted  for  the  bit-level  PE.  In  such  a  large  scale  system,  asynchronous 
operation  may  be  faster  than  the  clocxed  method  shown. 

If  tne  array  is  used  for  vector  multiplication,  it  performs  t.ne 
operation 

^  A2  ...  A^  TbTI 

n 

“2 

*  where  -  *  *.3,  ♦  A -> E-,  . .  .  ’  9 


nacr.  vector 
b  :  t  ma  y  t 


*.  is  a  word  consisting  of  n  signed 
*  .  C  or  ‘  .  W.ner. 
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are  expanded  into  bit  representation,  one  can  see  tnat  tne  systolic  array  is 
actually  performing  matrix  multiplication  on  the  bits  of  the  vectors: 
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The  bits  of  tne  answer  are  found  by  adding  C  along  the  lower-left-to-upper- 
rignt  diagonals: 
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how  this  happens,  refer  to  Figure  6.  On  tne  first  clock  cycle,  A 


2  (tne  MSBs  of  A1  and  B^ ,  respectively)  are  shifted  into  the  bottim 


cell,  Aultiplied ,  and  added  to  ^  (initially  zero).  All  the  other  MSV's  are 
shifted  into  their  first  cells.’  On  tne  next  cycle,  C1  1  shifts  up,  and  adds 
to  the  product  of  2  and  B2  2*  Th®  products  C1  2  and  C?  1  are  f°rmec*  t0  the 
right  and  left,  respectively!  of  The  entry  C2  2  started  just  below 
C.  Subsequent  clock  cycles  shift’ the  multiplicand^  in,  and  the  answer  out 


in ’the  order  shown  in  Figure  6. 


The  total  time  for  this  calculation  is  given  by 
T  =  (2n  +  m  -1 )t, 


(21) 


where  t  is  the  time  for  each  clock  cycle.  Since  this  is  a  digit  online 
network,  calculation  is  started  on  the  MSB's  before  the  LSB's  are  needed. 
Another  significant  measure  of  performance  is  the  time  between  the  entry  into 
the  array  of  the  MSB's,  and  the  exit  from  the  array  of  the  answer  MSB.  This 
time  may  be  calculated  by 


MSB 


=  (n  +  m  -  1 ) t . 


(22) 


In  VLSI  applications,  the  number  of  elements  and  the 
interconnections  are  botn  significant.  These  values  are  given  by 


number  of 


-  n  -2n  +  2nm  -  ra  +  1 
^  and 

=  5n^  -  18n  +  lOnm  -  9m  ♦  12. 


(25) 

(24) 


Note  that  all  of  the  equations  are  also  true  for  the  matrix  x  matrix 
multiplication,  in  the  fully  digit  online  case.  If  the  data  are  shifted  in 
parallel,  more  interconnections  will  be  needed. 
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6. 1.1.1  Observations 

Reports  by  Mouftah  [93 j  and  Aytac  [94],  on  which  much  of  the  PS  is 
based,  indicate  that  the  addition  logic  may  be  too  slow  for  many  applications. 
The  speed  is  not  known  absolutely,  however,  because  tne  logic  gates  presented 
in  the  reports  are  based  on  use  of  5  urn  line  widths.  Implementation  of  the 
array  multiplier  would  be  in  2  or  1  micron  technology,  which  could  result  in  a 
significant  speed  increase. 

Another  possible  problem  is  the  transmission  of  two  control  signals, 
carry  and  neg,  through  the  array.  Though  these  signals  should  be  local,  they 
are  not  synchronized  with  tne  clock,  and  may  not  be  nearest-neighbor 
transmittable.  Further  research  into  this  problem  will  probably  yield  a 
satisfactory  local-communication  solution. 

Future  woric  in  tnis  area  could  include  a  comparison  between  tnis  network 
and  alternative  matrix  multipliers,  including  pipelined  parallel  or  digit 
online  multipliers,  binary  and  higher- radix  implementations,  and  different 
array  configurations.  In  addition,  the  PE  cell  should  be  simulated  using  a 
MOS  simulation  program,  and  characterized  in  terms  of  speed  and  area.  Using 
this  data,  or  making  logical  assumptions  about  it,  the  speed  and  area  of  the 
entire  network  will  be  calculated  from  the  formulas  presented  earlier.  In 
addition,  the  speed  and  area  of  alternative  PE's  should  be  investigated  and 
compared . 

6.1.2  Digit  Online  Vector  Multiplier  Using  SBNR  Adder  Tree 

In  pipelined  signal  processing  systems,  the  maximum  rate  of  data  flow 
through  the  pipe  is  determined  by  tne  slowest  element.  Traditional  pipelined 
systems  consist  of  a  few  slow  elements,  connected  by  parallel  data  paths.  In 
digit  online  systems  [91],  a  redundant  number  system  is  used  to  allow  data  to 
flow  as  a  stream  of  bits,  with  the  most-significant-bit  leading  the  stream. 

Irwin  and  Owens  have  identified  many  advantages  to  this  mode  of 
operation.  The  first  is  that  the  bit-stream  approach  allows  the  system  to 
perform  bit-level  operations  on  the  data.  Since  the  slowest  of  these 
operations  is  much  faster  than  the  slowest  word  parallel  computation,  a  ouch 
faster  clock  rate  may  be  used,  possibly  increasing  data  throughput.  Tne 
second  advantage  of  digit  online  architectures  is  that  result  bits  can  begin 
streaming  out  of  a  processor  after  only  a  small  online  delay  from  the  start  of 
the  input  data.  The  result  can  then  be  used  in  the  next  processor.  This 
effect  allows  several  links  in  the  processing  chain  to  operate  simultaneously 
on  results  generated  from  a  single  data  word.  Thus,  the  effective  throughput 
of  an  element  is  determined  more  by  its  online  delay  (latency)  than  the  total 
time  of  computation.  The  third  advantage  of  digit  online  systems  is  that  of 
cnip  pinout.  Since  tne  data  are  transmitted  in  bit-serial  mode,  the  number  of 
pins  on  a  chip  doe3  not  depend  on  the  length  of  the  data  words. 

Space  Teen  nas  investigated  a  digit  online  multiplier  that  computes  tne 
fixed-point  inner  product  of  two  vectors.  The  vector  elements  arrive 
simultaneously  on  separate  data  patns  in  bit-serial  format.  The  multiplier 
can  accept  either  the  most-  or  least-significant-bit-first  with  no  change  in 
calculation  time.  The  answer  bits  appear  in  tne  same  order  as  tne  input.  The 
multiplier  uses  Signed  Binary  Number  Representation  (SBNR)  to  allow  fully 
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parallel  addition  internally  and  to  maice  it  compatible  with  digit  online 
systems. 

Inner  Products 

In  many  digital  signal  processing  applications,  the  primary  function  of 
the  processor  is  finding  the  inner  product  of  two  vectors:  _ax  _b  *  c. 

For  two  vectors  a  and  b,  each  composed  of  m  elements,  a^  and  b^,  tne  inner 
product  is  defined  as: 


c  =  I  ai  bt  (25) 

i*1 

When  each  element  is  defined  by  an  n-bit  signed  binary  word,  the  vector 
multiplication  can  be  decomposed  into  a  matrix  inner  product  on  the  bits  of 
tne  vectors.  Thus,  if  each  element  a^  is  represented  as 


a.  =  I  2  a  k 
k=0 


then  the  inner  product  of  the  vectors  can  be  rewritten  as 
m  n- 1  n- 1 

c  -  I  I  I  2^k  a  k  bi# j  (27) 

i»1  k=0  j *0 

It  is  this  function  that  the  vector  multiplier  implements. 

6. 1.2.1  Vector  Multiplier  Structure 

Figure  10  shows  the  architecture  of  the  vector  multiplier.  The  bits  of 
vector  A  appear  on  the  m-wide  bus  at  the  top.  When  the  first  bit  appears,  the 
MSB  line  is  brought  high,  thus  latching  the  MSB's  of  _A  into  the  first  partial 
product  cell.  At  tne  same  time,  the  first  bits  of  3  appear  at  the  cell  ( AB^ ) , 
and  are  multiplied  by  the  corresponding  bits  of  _A.  — The  results  are  added,  and 
tne  sum  appears  at  the  bottom  of  AB, .  The  architecture  for  this  operation  is 
similar  to  the  Takagi  multiplier  [95] • 

In  the  next  clock  cycle,  the  MSB  signal  is  latched  to  the  right,  thus 
latching  the  bits  on  the  _A  bus  into  the  second  cell.  The  MSB's  of  vector 
also  latch  into  tne  second  cell,  and  the  next  bits  of  B  appear  at  AB1 .  Thus, 

ABp  contains  An-2  an<^  ®n-1  ’  ^B1  contains  An-1  an<^  — n-2*  These  partial 

products  are  then  calcuTated.  The  product  from  the  first  cycle  is  latched 
into  the  top  of  the  adder  tree. 


On  the  third  cycle,  the  two  second-level  partial  products  from  A31  and 
,  are  added  together  in  tne  m-wide  parallel  adder.  The  _B  bits  are  right- 
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snifted,  and  the  thiru  bits  of  A  are  latched  into 
product  moves  down  to  the  next-level  adder. 
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Figure  10.  Digit  Online  Vector  Multiplier 
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Similarly,  each  subsequent  clock  cycle  generates  tne  next-lower  bit 
position  of  the  product.  Each  of  these  cascades  through  the  adder  tree  until 
it  hits  the  data  skew  block. 


The  function  of  this  element  is  to  shift  tne  incoming  words  to  tneir 
proper  bit  position  in  the  result.  The  skewed  word  is  then  added  to  the 
cumulative  sum  to  produce  the  final  sum.  The  result  is  shifted  out  of  tne 
bit-select  latch. 


6. 1.2. 2  Area  and  Time  Complexities 


The  area  and  time  factors  of  this  multiplier  are  excellent  for  single¬ 
chip  implementation.  The  area  complexity  is  0(mn+nr),  whicn  is  better  than 
that  of  the  bit-level  systolic  array  [96].  The  time  for  calculation  is  2n 
clock  cycles,  but  the  latency  is  just  n+log2n,  and  is  not  dependent  on  m.  For 
the  case  of  m=*1 6  and  n-16,  this  multiplier  takes  76,600  transistors,  which  is 
half  tne  number  needed  for  16  paralleled  Takagi  multipliers.  If  tne  latency 
were  reduced  to  something  less  than  n,  the  next  stage  of  the  pipeline  could 
operate  on  the  inner  product  as  it  was  being  calculated.  This  reduction  may 
or  may  not  be  possible,  however. 


In  addition,  the  architecture  seems  too  hardware  intensive  for  a 
pipelined  system.  Further  research  should  be  done  to  try  and  use  recursive 
properties  to  reduce  or  eliminate  the  full-parallel  elements  and/or  tne  adder 
tree.  The  next  step  is  investigation  into  alternatives,  specifically,  a 
structure  like  tnat  proposed  by  Rhyne  and  Strader  [97].  Any  alternatives 
found  should  be  characterized  with  respect  to  this  and  other  architectures. 
If  none  are  found  to  be  better,  more  comparisons  should  be  made  between  this 
multiplier  and  the  alternatives. 


6.1.3  Systolic  LMS  Architecture 


Recursive  least-mean-square  algorithms  have  wide  application  in  many 
types  of  estimation  problems.  One  such  application  is  adaptive  beamforming. 
Beamforming  is  commonly  used  in  radar  and  sonar  applications,  both  in 
transmitting  directed  wavefronts  and  receiving  from  selected  directions. 


A  trade-off  exists  between  the  speed  of  adaptation  and  the  stability  of 
the  formed  beam  [98].  In  general,  the  more  recursions  needed  for  adaptation, 
the  more  stable  is  the  steady-state  performance.  However,  an  increase  in 
system  throughput  will  speed  up  the  adaptation  without  affecting  the  steady- 
state  stability.  The  architecture  proposed  by  Space  Tech  provides  tne 
increased  throughput  needed  for  high  bandwidth  communications. 


6. 1.3.1  Recursive  LiMS  Algorithm 


Widrow's  LMS  algorithm  consists  of  an  adaptively  weighted  input  stage 
and^a  weight  update  stage.  We  minified  the  algorithm  to  allow  pipelining 
,33  J(  but  tne  architecture  included  two  systolic  array  processors.  An 
alternative  design  uses  a  pip. -Unable  algorithm,  but  only  a  single  systolic 
array.  To  see  how  tne  algorithm  works,  define  the  following  variables: 


system  word  length  in  bits 

number  of  receiving  antennas 

k  x  n  bit  matrix  of  input  samples 

k  x  n  bit  matrix  of  weighting  coefficients 

bit  vector  of  filter  output 

bit  vector  of  input  reference  signal 

d  -  y  «  bit  vector  of  filter  error 

convergence  rate  factor 


(The  value  of  a  bit  vector  is  just  the  vector  multiplied  by  powers  of  two: 


-,k-1  -,k-2 


...2  1  „  £  ) 


The  output  of  tne  filter  is  given  by 


y  -  S1  W 


Tne  filter  weights,  W,  are  determined  by  iteratively  comparing  tne  filter 
output  to  the  training  signal,  d.  This  difference,  when  multiplied  by  the 
input  matrix,  gives  the  line  of  steepest  descent  toward  convergence  [98^.  Tne 
equation  to  calculate  the  new  filter  weights  is 


Wj-1  +  2uej-1Sj-1 * 


Left  multiplying  this  equation  by  the  current  input  gives 


s/  "o  ■  si!  “j-1  *  a“ej-tsiT  SJ-1- 


If^we  assume  that  the  inputs  are  uncorrelated  (i.e. 

E  [s.  S'1" .!  + .  ]  *  0) ,  then  we  can  make  the  following  approximation: 


S  ,TW  . 


STj.,V,  *  2»ej.1SjTSr 


Andrews  implemented  a  similar  function  using  two  systolic  arrays,  one  to 


calculate  y .  for  S^,  and  one  to  update  W .  1 .  The  architecture  below  uses  a 
single  systolic  array,  combined  with  a  small” number  of  latches,  serial  adders, 
and  serial  multipliers.  Also,  since  the  weights  have  been  removed  in  Equation 
3,  no  weight-update  phase  is  required. 


6.1 .3.2  Architecture  Details 


The  arrangement  shown  in  Figure  11  implements  the  LMS  algorithm  of  (30'. 
The  systolic  array  multiplies  the  input  by  itself,  STS.  The  output  from  tne 
array  is  on  K  parallel  lines,  all  of  equal  significance.  Thus,  at  any  clock 
cycle,  each  line  carries  a  value  in  the  same  bit  position  as  all  the  other 
lines. 
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These  wires  are  fed  into  a  synchronous,  bit-serial  adder  tree  tnat 
accumulates  all  of  j;he  partial  products.  The  result  of  this  accumulation  13 
tne  inner  product  S*S,  which  is  fed  into  a  bit-serial  multiplier  flt  tne  same 
time  as  tne  product  ue.  Tne  x2  factor  results  from  bringing  S‘S  into  tne 
multiplier  one  clock  cycle  ahead  of  ue,  thus  performing  a  left  shift  on  the 
product.  The  resulting  value  is  the  steepest-descent  gradient  of  the  error 
surface.  The  new  output  is  formed  by  adding  the  error  value  to  the  old 
output . 

If  all  arithmetic  is  performed  with  length-K  operands  and  results,  the 
latency  of  the  architecture  is 


latency  =  N  *  log2K  +  1 


(31  ) 


and  a  new  sample  may  be  entered  every  K  clock  cycles.  The  clock  period  is 
determined  by  the  speed  of  the  serial  multiplier.  If  Signed  Binary  Number 
Representation  is  used,  all  arithmetic  inside  the  multiplier  may  be  done  in 
parallel,  with  a  total  delay  of  35t,  regardless  of  word  length  (t  *  delay  of 
one  transistor).  Total  device  count  for  the  circuit  is: 


KxN 

K+1 

2 

Kx(N+2) 


Multiply  -  accumulators 
Serial  adders 
Serial  multipliers 
Latches 


Thus,  the  device-latency  product  is  0(N2K  +  NKlog^K) . 


6. 1.3. 3  An  Adaptive  Beamformer  Application 


Digital  adaptive  beamforming  is  commonly  applied  in  communications. 
Applications  range  from  voice  communication  over  VHF/HF  bands  in  the  tens  of 
kHz  up  to  secure  spread-spectrum  data  links  with  RF  bandwidths  in  the  10  MHz 
area  [99] •  The  latter  case  places  strict  requirements  on  the  throughput  of 
the  processor.  To  allow  sampling  at  the  Nyquist  rate  each  linK  in  the  pipe 
must  have  a  bit  delay  no  greater  than  50/K  nsec.  For  a  16-bit  word  length, 
each  Processing  Element  must  accept  a  new  bit  every  3  nsec,  corresponding  to  a 
clock  rate  of  333  MHz. 


This  rate  would  be  extremely  difficult  to  sustain  if  the  circuit  were 
spread  over  a  large  board.  Fortunately,  the  proposed  architecture  is 
primarily  a  systolic  array  with  an  adder  tree,  so  the  interconnections  are 
regular  and  nearest-neighbor.  Thus,  most,  if  not  all,  of  the  circuit  may  be 
implemented  on  a  single  VLSI  chip. 
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6.1.4  Time  and  Area  Calculations  for  SBNR  Array  Multiplier 

This  section  describes  Space  Tech's  studies  of  a  systolic  array 
architecture  which  uses  MVL  and  a  SBNR  to  multiply  two  vectors.  Tne 
architecture  is  compatible  with  digit  online  pipelined  networks  „  m  wr.icr. 
data  are  transmitted  serially  and  most-significant-bit-first .  Tne  array 
structure  makes  chip  layout  easy,  and  the  local  communication  patns  reduce 
interconnection  area.  Therefore,  tnis  systolic  array  snould  be  easy  to 
implement  in  VLSI. 

6. 1.4.1  Array  Structure 


The  fundamental  operation  performed  by 
multiplication 


tr.e  processor  is 


lAi  a; 


V  rB? 

" 


C  ■  *,3,  •  A2B2 


. .  ♦  A  B 
m  m 


B 

L“J 

To  maximize  speed,  one  can  use  a  systolic  array  of  PE's  in  whicn  many  bit- 
level  calculations  occur  simultaneously.  Such  a  structure  is  shown  in  Figure 
12.  In  tnis  case,  tne  vectors  are  four  words  long,  and  tne  words  are  tnree 
bits  long. 

Vectors  _a  and  _b  are  shifted  in  from  the  bottom,  most-signif icant-bi t- 
first.  The  result  c  comes  out  tne  top,  as  snown.  All  carries  from  additions 
are  transmitted  asynchronously  to  tne  upper  right  of  each  cell.  The  PE  to  tne 
upper  right  always  holds  one  portion  of  the  next-higher  bit.  The  "horn"  that 
extends  up  and  to  the  right  from  tne  central  multiplying  core  is  present  to 
add  up  the  carries  from  lower-order  bits.  (Since  the  array  uses  SDNR 

arithmetic,  a  cell's  carry  depends  only  on  its  operands  and  tne  neg  control 
signal  from  tne  lower  cell,  and  not  on  the  carry  coming  in  to  the  cell.  Thi3 
limits  carry  propagation  to  a  single  cell,  „nd  gives  signed  binary  a 
significant  speed  advantage  over  conventional  binary. )  As  tne  answer  is 
shifted  out  of  the  top,  an  adder  tree  adds  up  the  bits  that  occupy  each 
position  of  significance.  The  final  result  is  tnen  snixted  out  of  tne  adder 
tree. 


The  method  of  operation  of  the  array  13  shown  in  Figure  13*  Cn  tne 
first  clock  cycle  (Figure  13a},  tne  MS3's  of  A1  and  31  are  shifted  into  tne 
bottom  cell  and  multiplied  together.  The  result  is  added  to  (initially 
zero),  and  the  carry  travels  up  and  to  the  right  where  it  is  added  to  C_ 
(also  zero).  A  control  signal  ntf  is  also  propagated  to  C_  to  be  used  in  the'’ 
addition.  At  tne  same  time,  ail  the  other  bits  in  the  first  row  (A^ 

_,  ...  and  3-  _,  3  ,  ...)  are  shifted  into  their  first  cells  afc<d 

multiplied  by  zerfi.  J  ' 


On  tne  second  clock  cycle  (shown  in  Figure  13b),  the  new  C  and  CK  are 
,  is  added  to  A?  ^  x  2 ’  an^  that  carry  is  added  to  C^as 
“carry  generated  itom  the  term  (C^  +  0)  propagates  up  and  to  tne 


snifted  up, 
before.  Any 


right  into  C,.  In  addition,  the  terms 


Ai  .1  11  Bi  ,2 


and 


A1  ,2  *  31 ,1 


(34 


are  generated  just  below,  and  on  either  side  of,  C  .  The  shift  registers  at 
tne  top  of  tne  array  ensure  that  all  bits  of  a  given  significance  arrive 
simultaneously. 

The  equations  below  give  the  number  of  different  cells  required  to 
construct  a  systolic  array  that  will  multiply  two  m-word  vectors,  where  eacn 
word  is  n  bits  long  and  is  represented  in  S3NR.  The  number  of  multiply- 
accumulate  PE’s  (hexagons  in  Figure  12)  is  given  by 


PE  »  n2  ♦  (2n-1 )(m-1 ); 
the  number  of  adders  (diamonds)  is 

Add  =  1/2  [  n(n-l)  +  (n+m-1 ) (n+m-2)  -  1  ]; 
and  the  number  of  shift  registers  (squares)  is 
Reg  *  (3n+m-4)^  +  (n+m-2). 

The  total  multiplication  time  for  the  array  is 
T  »  (2n+m)t,  where  t  =  clock  period. 


(35: 


(36) 


(37) 


(38) 


In  pipelined  systems,  another  important  measure  is  the  latency — tne  time 
from  the  start  of  the  incoming  data  to  the  start  of  the  outgoing.  For  tne 
array  tms  number  is 


(n*m; t , 


Thus,  the  MSB's  of  the  vectors  are  clocxed  in,  and  (n*m)  cycles  later,  the 
answer  MS3  is  clocxed  out.  Every  cycle  after  that,  two  more  bits  can  be  made 
available,  or  they  can  be  buffered  and  streamed  out.  The  latter  metnod  gives 
a  total  multiply  time  of  ,'3n*m.  cycles. 


6. 1.4. 2  Processing  Element  Circuit 

The  tnree  types  of  cells  needed,  tneir  functions,  and  their  I/C  pates 
are  shown  m  Figure  14.  Figure  14a  is  the  Ternary  Muitiply-Accumulate  Ceil 
(TMAC),  which  forms  tne  core  of  tne  array.  Part  b  is  a  PE  whicn  does  not 
multiply  any  numbers,  it  simply  adds  the  previous  sum  to  zero  and  generates  a 
new  sum  and  carry.  The  shift  register  in  Figure  3c  is  used  to  align  carries 
with  new  rows,  or  to  align  the  answer  bits  at  the  output. 

The  TMAC  in  Figure  14a  multiplies  the  incoming  A  and  3  bits,  and  adds 
them  to  the  incoming  sum  C.  The  output  signals,  carry  and  neg,  are 
generated  from  AS  and  C  for  use  in  the  PE  to  the  upper  right.  The  next  C  is  a 
function  of  A3,  C  and  the  carry  and  neg  signals  from  the  lower-left  PE.  On 
the  next  clock  cycle,  the  sum  and  two  multiplicands  are  shifted  to  the  next 
higner  cells. 

The  ternary  adder  cell  in  Figure  14b  adds  tne  incoming  C  to  zero,  since 
there  is  no  AB  term.  In  SDNR,  unlike  binary,  this  addition  can  generate  a 
carry.  So  the  adder  cell  operation  is  the  same  as  that  of  the  TMAC,  except 
that  no  multiplication  is  necessary.  As  a  secondary  function,  if  the  adder  PE 
is  in  tne  patn  of  the  B  coefficients,  it  acts  as  a  3hift  register  for  them. 

The  ternary  latch  cells  (Figure  14c)  act  as  one-cycle  delays  for 

wnatever  they  are  latching.  Sometimes  a  latch  cell  will  transmit  a  C  value 

vertically.  Sometimes  it  will  taice  a  carry  from  one  column,  and  turn  it  into 
a  sum,  C,  in  the  next  column.  The  latches  that  are  in  the  path  of  the  3 
coefficients  will  also  shift  them. 

Figure  15  summarizes  the  ternary  CMOS  logic  gates  developed  by  Mouftah 
,93*.  Huertas  [lOOj,  and  Balia  [lOl]  that  were  used  in  the  design  of  the  PE's. 
Also  shown  is  an  S-gate  (switch  gate)  which  is  a  pair  of  transmission  gates 
that  pass  one  of  two  inputs  based  on  the  control  signal. 

The  circuit  used  to  implement  the  TMAC,  and  its  truth  tables,  are  shown 

in  Figure  16.  The  inputs  are  latched  into  the  flip-flops  at  the  bottom.  From 

there,  A  and  B  are  multiplied  and  tne  neg  signal  is  generated.  Using  A,  B 
and  C,  the  TMAC  generates  the  carry,  and  with  the  carry  coming  up  from  the 
next  lower  cell,  it  generates  the  sum.  The  total  time,  from  tne  rising  edge 
of  the  cIock  to  the  stable  sum,  is  no  longer  than  17  transistor  delays.  This 
amounts  to  about  17  ns  in  5  micron  technology.  During  the  low  phase  of  the 
clock  cycle,  A  and  3  are  latched  into  the  outputs.  The  sum  is  not  latched 
because  it  will  stay  stable  for  much  longer  tnan  tne  latches  of  tne  next  cell 
require  to  shift  it.  The  TMAC  uses  212  transistors. 

The  circuit  for  the  adder  (not  shown)  is  similar  to  the  TMAC,  except 
that  the  circuitry  which  multiplies  A  and  3  is  gone,  and  all  inputs  with  A,  3, 
or  A3  are  grounded.  Some  of  tne  combinational  logic  is  also  simplified.  The 
adder  uses  98  transistors. 

Tne  latch  circuit  consists  only  of  two  flip-flops.  One  latches  a  value 
in  on  the  positive  clock  level,  and  the  other  latches  it  out  when  the  clocx  is 
negative.  The  latch  circuit  requires  32  transistors. 


Figure  14b.  Ternary  Adder  Cell. 


Figure  14c.  Ternary  Shift  Cell 


One  can  calculate  the  number  of  transistors  needed  to  implement  a 
specific  array  using  the  equations  of  Section  6. 1.4.1.  The  total  number  is 
given  by 

transistors  *  (PE)(trans.  per  PE)  ♦  (Add)(trans.  per  Add)  (40) 

+  (Reg)(trans.  per  Reg) 

Table  3  gives  the  number  of  CMOS  transistors  in  the  array  for  different  vector 
sizes  and  word  lengths.  Those  readers  who  are  familiar  with  current  VLSI 
capability  will  recognize  that,  despite  a  highly  regular  structure  and  low 
interconnection  area,  most  of  the  numbers  given  are  larger  than  can  be 
implemented  on  a  single  chip.  Even  among  those  that  are  3mall  enough,  tnough, 
tne  chip  yield  will  be  rather  low. 
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Unfortunately,  tnis  extra  space  does  not  buy  a  faster  processor.  The 
time  for  a  1 6-bit  multiply  of  length-16  vectors  is  48  clock  cycles.  That 
amounts  to  about  816  ns.  The  time  for  the  same  multiply  implemented  with  16 
parallelled  1  6-bit  multipliers  [95]  is  120  ns.  That  multiplier  setup  requires 
only  350,000  transistors,  a  size  increase  of  10  percent  over  the  array.  Ten 
percent  is  a  small  price  in  area  for  a  seven-fold  speedup. 


To  compare  area  and  time  trade-offs,  one  uses  the  area-time  product. 
The  second  column  from  the  rignt  in  Table  3  contains  the  area-time  products 
for  tne  systolic  array.  The  last  column  represents  the  comparative  numbers 
from  the  Takagi  multiplier.  Note  that  at  the  closest,  these  numbers  are 
separated  by  a  factor  of  1.5,  and  tnat  that  distance  increases  with  vector 
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This  relationship  is  easily  seen  from  the  complexities  of  tne  two 
architectures.  The  area  of  the  systolic  array  is  0(n^+m^),  while  the  area  of 
the  parallel  multiplier  is  oCm^log^n).  From  this,  one  can  see  that  the  area 
of  the  multiplier  increases  faster  with  n  than  the  area  of  the  systolic  array. 
Unfortunately,  the  array  starts  out  so  much  bigger,  and  the  multiplier  cannot 
easily  catch  up.  The  time  complexity  of  the  array  is  0(n+m).  Since  tne  PE's 
in  the  parallel  multiplier  operate  simultaneously,  the  multiply  time  is 
independent  of  m,  and  furtner,  the  multiply  time  is  not  strongly  dependent  on 
n,  being  only  0(log  n). 

The  multiplier  actually  has  a  higher-order  area-time  complexity, 
O(n^mlog^n),  than  the  systolic  array,  0(n^+nr).  However,  the  basic  cell  in 
the  systolic  array  is  more  complex  than  its  counterpart  in  the  multiplier. 
Since  tne  array  is  clocked,  each  cell  must  contain  a  latch  for  each  data  bit 
going  in.  In  addition,  many  of  the  ternary  logic  gates  in  the  array  use  more 

transistors  than  tne  binary  gates  in  the  multiplier,  which  adds  even  more 
area.  The  result  of  these  factors  is  that  the  area- time  product  remains  lower 
for  the  multiplier  than  for  the  array  as  long  as  the  word  length  is  less  than 
97  bits. 

Figure  17  compares  the  area-time  products  for  the  systolic  array  and  tne 
parallel  multiplier  for  n<_1  28  and  mjCn.  The  objective  of  a  lower  area- time 
product  for  the  systolic  array  is  achieved,  but  only  over  a  limited  range. 
The  curves  show  that  wnen  m=0.5n,  and  n>97,  the  area- time  product  of  the 
systolic  array  is  lower  than  that  of  the  parallel  multiplier.  One  can  also 
see  how  the  stronger  dependence  on  n  is  causing  a  sharper  upward  slope  in  the 
multiplier  curves,  relative  to  those  of  the  array.  In  fact,  for  n>_128,  the 
area-time  product  of  the  array  is  lower  than  that  of  the  multiplier  for  values 
of  m  as  high  as  0.75n.  The  other  two  curves  will  also  catch  up  if  n  is  made 
larger. 


Unfortunately,  word  lengths  of  100  bits  or  more  are  very  rare.  This 
fact  limits  the  applicability  of  the  array  architecture  described  to  extremely 
high  precision  operations.  To  make  the  systolic  array  competitive  for  shorter 
words,  its  hardware  must  be  simplified.  A  two-fold  reduction  in  the  number  of 
transistors  in  each  cell  would  halve  the  area-time  product  at  every  point, 
placing  the  systolic  array  significantly  lower  than  the  multiplier  on  the 
area-time  graph.  This  reduction  would  allow  the  systolic  array  to  outperform 
the  parallel  multiplier  on  word  lengths  as  low  as  26  bits.  For  maximum 
benefit,  the  vector  length  should  be  kept  to  around  half  tne  word  length. 

6.1  .4.4  Recommendations  For  Future  Research 

Optimization  of  the  PE  circuit  is  essential  for  implementation  as  a 
vector  multiplier.  As  was  snown  in  Section  6.1  .4.5,  a  two-fold  reduction  in 
PE  hardware  would  make  the  systolic  array  useful  for  much  smaller  word  lengths 
(such  as  52  bits).  For  that  reason,  future  research  should  concentrate  on 
characterizing  various  cell  configurations  such  as  signed  binary  logic,  ECL, 
I^L  and  CCD  gates.  The  design  in  anotner  technology  might  yield  significant 
savings  in  nardware. 

The  results  suggest  tnat  tms  architecture  might  be  better  suited  to 
word-level  multiplication  of  matrices,  using  constant  precision  operands  and 
results.  Such  a  setup  would  eliminate  the  extensive  hardware  otherwise  needed 
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to  handle  carries  out  of  the  multiplication  core.  Data  transmission  between 
the  cells  could  be  pipelined  or  parallel.  If  it  is  deemed  necessary  to  obtain 
a  higher-precision  product,  either  more  time  could  be  allowed  for  the  serial 
transmission  of  the  extra  bits,  or  additional  lines  could  be  added  between 
cells  for  parallel  communication. 

To  augment  the  streamlining  of  the  systolic  array,  research  should 
continue  in  alternative  architectures,  both  pipeline  and  parallel.  Tne 
objective  is  a  vector  multiplier  that  can  operate  on  long  vectors  (>32  words) 
composed  of  short  words  (8-16  bits).  The  best  multiplier  would  be  able  to 
handle  variable-length  operands  without  any  serious  slowing.  Included  in  this 
investigation  should  be  other  forms  of  systolic  arrays,  and  combinations  of 
parallel  and  pipeline  structures. 
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6.2  Fault-Tolerant  Architectures 


6.2.1  Residue  Number  Systems  (RNS) 

A  major  competitive  number  system  offering  high  reliability  modularity 
and  thus  capable  of  fault-tolerance  is  the  Residue  Number  System  (RNS).  As 
summarized  by  Jenkins  [ 1 02 3  »  an  RNS  is  defined  by  a  set  of  moduli,  M  « 
N1,...,mL!  which  are  pairwise  relatively  prime  integers  (i.e.,  no  pair  from 
the  set  contains  a  common  non-unity  factor).  Natural  numbers  in  the  range  R  » 
[o,M-l],  with  M  *  m.m^.-.m^,  are  encoded  by  L  residue  digits  x. where 
*  (X)mod  m^,  i  *  i,...,L,  wnere  X  in  R.  Residue  arithmetic  is  defined  by 

(X1X2***XL)  #  (y1y2*"yL^  “  ( Z1  z2  *  *  ”  zl)  • 

the  z1  *  (x1  *  y1 )mod  m.,  where  #  is  one  of  addition,  subtraction,  or 

multiplication.  Note  that  RNS  arithmetic  has  a  natural  modular  structure  that 
leads  to  modularity  and  parallelism  in  tne  hardware. 

The  lack  of  communication  among  digits  in  residue  arithmetic  suggests 
tnat  if  an  error  occurs  in  one  digit  it  cannot  be  propagated  into  otner  digit 
positions  during  subsequent  operations  involving  addition,  subtraction,  or 
multiplication.  This  property  provides  a  basis  for  fault-tolerance  that  is 
inherent  in  the  basic  algebraic  structure,  and  which  can  be  used  to  obtain 
fault-tolerant  hardware  architectures.  During  some  of  the  more  difficult  RNS 
operations  such  as  scaling,  division,  or  magnitude  comparison,  there  is 
interaction  between  residue  digits  and  this  error  isolation  property  is  not 
preserved.  Therefore,  the  fault- tolerant  properties  of  RNS  arithmetic  are 
particularly  useful  for  certain  types  of  signal  processing  applications  wnere 
most  of  the  computation  consists  of  addition,  subtraction,  and  multiplication. 
Two-dimensional  digital  filtering  used  in  image  enhancement  and  feature 
extraction  is  an  example  of  a  computation  intensive  operation  that  is  ideally 
suited  for  RNS  techniques. 

The  nonweighted  structure  of  tne  RNS  code  is  another  basic  property  tnat 
makes  residue  arithmetic  useful  in  the  design  of  fault-tolerant  hardware 
structures .  If  a  particular  residue  digit  is  consistently  erroneous,  the 
corresponding  faulty  module  can  be  identified  by  RNS  error  checking  techniques 
and  disconnected  without  affecting  tne  other  modules.  If  the  original  RNS 
contains  enough  dynamic  range,  the  reduced  processor  can  continue  functioning 
with  a  reduced  dynamic  range.  This  concept  is  called  soft  failure  because  tne 
processor  does  not  catastrophically  fail  when  a  hardware  failure  occurs,  but 
ratner  tne  faulty  module  is  disabled,  and  the  remaining  modules  continue 
functioning  in  a  useful  although  restricted  manner.  If  desirable,  error 
correction  can  be  used  to  replace  the  function  of  the  faulty  module  provided 
enough  redundancy  is  designed  into  the  code. 

6. 2. 1.1  Residue  Number  Implementations 

Applications  of  RNS  tneory  to  general  purpose  computers,  as  well  as  tne 
use  of  redundant  residue  digits  to  provide  error  detection/correction  in  RNS 
structures,  has  been  researched  for  a  number  of  years.  More  recently, 
advances  in  VLSI  circuit  technology  have  renewed  interest  in  RNS  applications 
to  Digital  Signal  Processing  (DS?  .  Altnough  some  problems  still  exist  witn 
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magnitude  comparison,  division,  scaling,  and  related  operations,  tecnnological 
improvements  nave  provided  economical  ways  to  work  around  tnese  shortcomings. 
Paul  i_1G3J  recently  examined  suer,  implementations  and  na3  shown  remarkable 
fault- tolerant  capabilities  for  the  realization  of  high-performance  DSP 
systems.  He  showed  how  system  reliability  can  be  enhanced  by  a  parallel 
processor  structure  partitioning  word  lengtn  among  processors  (just  as  we 
propose  to  do  for  S3NR).  His  implementations  of  Redundant  Residue  Number 
Systems  (REN’S;  ennanced  identification  of  faulty  processors  and  nodular  system 
degradation.  As  we  also  propose,  he  showed  the  efficacy  of  short  word  length 
of  the  residues.  A  clue  to  his  gracefully  degradable  design  i3  the 
incorporation  of  pipelined  memory  accesses  optimized  for  speed.  S3 NR 
realization  nandily  captures  the  same  pipelined  enhancements.  Even  more  sc, 
since  S3NR  has  elements  in  (-1,0,1)  instead  of  several  primed  modules  as  found 
in  RNS.  We  believe  that  tne  use  of  redundancy  residue  codes  for  fault- 
tolerance  capability  already  examined  by  [lQ4-107j  carry  over  directly  to  S3NR 
real izations . 

6.2.2  Graceful  Degradation  1 CS ] 

Kj  ppi.ng  algorithms  onto  processor  arrays  has  been  widely  investigated 
todate  ^67, 109-1 1 3] •  Some  general  observations  can  be  made  form  these 
efforts.  Even  tnough  the  dynamic  reallocation  of  data/instructions  is 
complicated  and  relatively  slow,  dynamic  array  configuration  is  just  as 
injurious  in  tne  graceful  degradation  issue.  Even  an  array  witn  complete 
reconfigurability  is  difficult.  In  our  architectural  studies  proposed  herein, 
tne  large  array  configurations  exacerbate  tnese  issues.  Hence,  greater 
concern  for  graceful  degradation  issues  are  necessary. 

Tnere  exist  alternative  solutions,  device  redundancy  being  one  of  them. 
Another  is  to  attempt  to  map  smaller  algorithms  onto  the  same  configuration 
size,  assuming  that  spare  processors  are  freed  up  for  fault-tolerant  purposes. 

provides  a  critical  assessment.  As  Fortes  notes,  when  we  examine 
classic  architectures  such  a3  the  MPP,  ILLIAC  MI5],  CHiP,  Diogenes  arrays, 
NCR  45CG72,  PC  Systolic  machines,  turbo  boards,  and  hardware  accelerators,  a 
consensus  draws  tne  conclusion  tnat,  unless  a  functioning  replica  of  tne 
original  array  is  up,  graceful  degradation  is  impossible.  Other  solutions  to 
graceful  degradation  encompass  algorithm  rescheduling  methods  1 1 6  j  and 
classic  error  detection  correction  schemes  l52j.  Also,  ^ 1 1 7 j  reports  on  a 
clever  connectivity  preservation  scheme  for  VLSI  multiprocessor  systems. 

6.3  Wafer  Scale  Integration 

Pacicagei  integrated  device  reliability  is  improved  witn  less  pin  out 
_  *  1 9 J .  Input/output  pad3  are  susceptible  to  electrostatic  discharge, 
especially  on  M OS  circuits.  Also,  relative  I/C  pad  area  in  small-scale  Id’s 
is  nign.  Driver  area  is  high  and  they  are  power  hogs.  Finally,  I/O  pins  are 
mechanical  failure  prone. 

Pm  estimates  for  each  package  is  provide!  by  Rent's  Rule.  Rent's  Rule 
applies  especially  to  small  sub-modules  embedded  m  larger  systems.  When  the 
package  contains  a  major  subsystem  or  tne  entire  system.  Rent’s  Rule  13 
overbiased.  Consequently,  WSI  is  particularly  attractive  for  a  complete 
processor  integration. 
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Integrating  tnese  cnips  on  one  wafer  is  3pace  efficient  because  of 
inter- pacxage  connections.  Shorter  lines  nave  less  3elf-capacitance.  This 
reduces  tne  size  and  power  consumption  of  line  drivers.  Since  connections 
dominate  most  circuit  layouts,  WSI  can  substantially  speed  systems,  reduce 
total  internal  power  requirements  and  improve  density.  Nevertheless,  cauticn 
is  advised  because  the  interconnections  have  not  disappeared  altogether. 

Board-level  system  designers  address  signal  quality,  noise,  and  power- 
distribution  problems.  In  the  dense  layout  found  in  VLSI  or  WSI  they  become 
just  as  acute  at  even  lower  frequencies.  The  increase  in  interconnection 
density  also  produces  interline  coupling  problems  caused  by  mutual 
capacitance.  Interline  coupling  can  arise  between  adjacent  lines  on  a  given 
layer  as  well  as  between  lines  on  overlapping  layers  (most  dangerous  on  long 
runs  of  parallel  lines).  The  decreased  separation  between  lines  in  WSI, 
compared  with  tnat  in  other  packaging  arrangements,  plagues  designers.  Hence, 
long  parallel  runs  are  to  be  avoided  in  layout.  To  fabricate  a  WSI  system  tne 
size  of  a  full  wafer  demands  either  a  repair  capability,  a  tolerance  for 
failure,  or  a  combination  of  both.  (Consider  the  HP  RISC  chip  set.) 
Tolerance  for  failure  implies  redundancy  in  some  form,  while  repair  capability 
implies  intervention  in  tne  fabrication  process.  Yet,  tne  wafer-yield 
enhancement  resulting  from  this  redundancy  or  intervention  is  still  the  most 
attractive  benefit  of  WSI. 

6.4  Reliable  MVL  Systolic  Arrays 

Some  very  complex  algorithms  have  already  been  implemented  on  systolic 
arrays  by  Kung,  Leiserson,  and  Andrews.  Only  recently  have  MVL  systolic  array 
implementations  been  studied.  Andrews  has  established  tne  efficacy  of  MV1 
arrays  for  tne  least-mean-square  algorithm  which  is  much  simpler  than  the 
proposed  applications  to  be  studied  herein.  However,  Moraga  [ 1 1 9 ]  has 
demonstrated  MVL  array  effectiveness  for  Christenson  transforms  (a  Walsh 
transform  is  a  subset)  computations.  He  shows  how  a  Christenson  spectrum  of 
n-ary,  n-place,  p-valued  functions  are  configured  in  a  MVL  systolic  system. 
(Note,  these  are  complex  valued  functions.) 

His  VLSI  PE's  behave  as  an  MIMD  machine,  unlike  all  other  array  studies 
which  are  3IMD.  Tnis  useful  study  provides  us  witn  important  clues  to  develop 
our  algorithms  and  some  very  important  preliminary  results  as  discussed  next. 
Moraga  conveniently  provides  us  with  an  algorithm  to  generate  a  complete  test 
set  for  detecting  stucx-type  MVL  faults. 

In  a  binary  solution,  32  bit  ALU's  generate  32  bit  additions  and  complex 
multiplications,  producing  32  tit  truncated  results.  32  bit  data  and 
intermediate  results  would  nave  to  be  stored/transferred  between  cells  (nence, 
at  least  32  wires  are  needed  for  cell  interconnections).  In  an  MVL  design, 
Moraga  _,‘9J  snows  tnat  we  can  use  *  digit  arguments  and  tne  most  complex 
operation  is  a  1  digit  subtraction  mod  p.  Exact  results  for  the  spectral 
elements  are  obtained  a3  p  .  n  digit  words.  Hence,  even  for  a  6-place,  5- 
valued  'n-ary,  function,  we  require  less  than  32  digits.  Cne-out-of-p  coding 
and  parallel  updating  of  counters  is  men  realized  m  a  snorter  tine  tram  tnat 
required  for  tne  multiplication  of  two  16  x  16  bit  '.complex)  numbers.  Two 
important  advantages  are  now  evident  witn  MVL  m  systolic  arrays.  First,  far 
less  interconnections  between  cells  occur.  Second,  simpler  and  faster 
circuits  can  be  used  m  tne  ?E.  Note,  tne  '-out-of-p  coding  and  parallel 


update  counter  scheme  is  another  realization  of  distributed  arithmetic.)  We 
can  incorporate  tnese  same  promising  contributions  witnin  our  S3NF.  PE's. 
Reliable  MVL  circuits  can  then  be  analyzed  using  the  M-difference  calculus 
proposed  by  Lu  and  Lee  1 2C _  for  fault-detection  of  single  and  multiple  stuc.k- 
at  faults  in  MVL,  based  on  earlier  work  [l21,122]. 

6.5  Ternary  Logic 

Three-valued  or  ternary  logic  may  have  an  edge  on  binary  logic  i_  1 2;_. 
The  information  per  wire  ratio  is  higher;  the  complexity  of  interconnections 
can  be  reduced;  chip  area  reductions  appear  likely;  and  efficient  error- 
detection  and  error-correction  codes  can  be  employed.  Serial  arithmetic 
operations  are  faster.  As  sucn,  tnese  advantages  encouraged  study.  j_123_129 
offer  several  realizations.  Given  a  dynamic  range,  the  ternary  circuit 
complexity  [ 1 0 1 j  is  comparable  to  tnat  of  corresponding  binary  circuits. 
Nevertheless,  the  associated  reduction  in  tne  word  length  tends  to  ameliorate 
tne  pm-limitation  problems. 

A  new  family  of  ternary  logic  circuits  based  on  botn  depletion  and 
enhancement  types  of  complementary  MOS  transistors  (DECMOS)  has  been  shown  to 
be  useful  in  the  design  of  ternary  digital  systems.  Witn  tne  use  of  voltage 
power  supplies  below  the  transistors  tnreshold  voltage  and  the  exclusion  of 
resistors,  it  is  possible  to  implement  tms  circuitry  in  VLSI.  Also  tms 
offers  a  low  power  consumption,  high  speed  and  comparable  performance  to  tne 
binary  counterpart  circuitry.  New  ternary  logic  circuits  based  on  tne  use  of 
Depletion/Enhancement  Complementary  Metal-Oxide-Semiconductor  (DECMOS) 
Integrated  Circuits  has  been  demonstrated.  The  circuits  use  two  power 

supplies  each  below  the  transistors  threshold  voltage  and  do  not  include  any 
resistor.  The  circuit  design  of  basic  ternary  operators  (inverters,  NAND, 
NOR)  and  an  example  on  the  use  of  these  basic  ternary  operators  as  building 
blocks  in  tne  design  of  a  ternary  full  adder  is  now  available  [75].  In  [ 75  J 
the  Simple  Ternary  Inverter  (STI),  the  Positive  Ternary  Inverter  (PTl)  and  the 
Negative  Ternary  Inverter  (NTl)  are  tnree  possible  ternary  operators. 

6.6  Taxonomy  of  Fault-Tolerant  Schemes  [78] 

6.6.1  Fault-Tolerant  Nodes 

In  tnis  scheme,  spare  PE's  are  placed  at  each  array  node.  The  spare 
PE's  may  be  arranged  in  a  number  of  different  ways.  Figure  18  illustrates  tne 
case  wnere  tnree  PE's  and  a  checker  are  placed  at  each  node. 

6.6.2  Temporal  Redundancy 

In  many  systolic  arrays,  the  PE's  are  idle  for  a  large  percentage  of  tne 
time.  The  basis  of  the  temporal  redundancy  scheme  is  to  replace  a  faulty  PE 
by  an  idle  neighbor,  on  a  cyclic  basis  [l5uj. 

Alternatively,  this  idle  time  can  be  created  by  setting  up  each  PE  witn 
two  separate  inputs  and  forcing  the  PE  to  devote  half  of  its  cycles  to  eacn  of 
these  inputs  L 1 3 1 ] •  Appropriate  steering  circuits  must  also  be  provided  as 
illustrated  in  Figure  19. 


Figure  IS.  Node  Redundancy 
Scheme  [78J 


Figure  19*  Temporal  Redundancy 
Scheme  [ 1 3 *  ] 


Row  Bypass  Scheme  rl33j  Figure  21 .  Row-Oriented  Schemes  j 
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Figure  23.  Interstitial 


6.6.3 


Interconnection  Reconfiguration 


Witn  interconnection  reconfiguration,  fault-tolerance  is  provided  by 
rerouting  inter-PE  connections,  thereby  bypassing  faulty  cells.  The 
programmable  interconnect  required  can  be  provided  by  a  wire  joining  and 
fusing  process  or  by  using  active  switches. 

Tne  fuse  and  join  technologies  require  extra  processing  steps  for  tneir 
implementation.  However,  area  overhead  is  reduced.  Furthermore,  since  no 
switcnes  a.e  involved,  propagation  delays  may  be  smaller.  On  tne  other  hand, 
using  active  switches  to  provide  the  reconf igurable  routing  does  not  involve 
any  special  processing  steps.  It  also  offers  tne  potential  for  reliability 
ennancement  which  is  absent  in  other  technologies. 

6. 6.4  Switch  Types 

Three  types  of  switcnes  can  be  used  in  tnese  schemes  [ 1 3 2  ] : 

1.  Using  the  PE  as  a  Connecting  Element  (CE)  only,  thereby  bypassing 
its  normal  function. 

2.  Multiplexing  PE  I/O  ports  so  tnat  PE's  can  communicate  with  a  fixed 
choice  of  near-neighbor  PE' 3  (LI  =  Local  Interconnections). 

3.  Using  external  switches  that  can  be  as  flexible  as  desired  (31  = 
leneral  Interconnection). 

With  tnese  tnree  switch  types  the  area  overhead,  programming  difficulty  and 
reconfiguration  flexibility  increase  from  the  CE  to  the  GI  switch. 

6.6.5  Row  Bypass 

In  tne  row  bypass  scheme  [ 1 33 j ,  a  complete  row  of  PE’s  is  bypassed, 
using  01  switches.  An  example  of  such  a  scheme  is  illustrated  in  Figure  20. 
Tms  scheme  is  suitable  to  situations  witn  high  PE  yield,  such  as  bit-serial 
arrays. 

6.6.6  Rcw  Oriented 

In  the  row  oriented  schemes,  columns  are  organized  through  tne  rows, 
taxing  one  and  only  one  column  element  from  each  and  every  row.  The  unused 
elements  in  each  row  are  tnen  bypassed  using  GI  switcnes.  Schemes  wnere 
columns  are  organized  through  LI  switches  are  described  in  [ 1 34 , 1 35 > ' 36 ] . 
Each  of  tnese  papers  describes  simple  circuits  which  enable  tne 
reconfiguration  around  faulty  PE' 3  to  be  carried  out  completely  internally  and 
automatical ly . 

Schemes  where  the  columns  are  organized  tnrough  GI  switches  can  be  found 
in  tne  simplest  of  which  is  illustrated  in  Figure  21.  Because  GI 

switches  are  used,  no  simple  internal  reconfiguration  scheme  has  been  found. 
The  switches  must  be  programmed  externally. 
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6.6. 


2D  Perturbation 


In  tms  scheme,  the  PE's  are  "perturbed",  in  both  directions,  but  by 
only  a  small  number  of  PE  sites.  A  scheme  described  in  [ 1 3 8 ]  using  CE  and  LI 
switches  only  is  illustrated  in  Figure  22. 

6.6.3  Interstitial  Redundancy 

In  an  interstitial  scheme^  as  illustrated  in  Figure  23.  spare  PE's  are 
provided  between  node  sites  [ 1 39 J  *  The  spare  PE's  are  switched  into  the  array 
as  required,  using  LI  type  switches.  With  similar  complexity  to  tne  row  LI 
column  schemes,  this  scheme  performs  slightly  better  than  those  when  the  PE 
yields  are  between  65$  and  30%. 

6.6.9  Hierarchical  Scheme 

In  i_140j,  a  GI  switched  scheme  is  described.  Here  PE's  are  organized 
into  bloctcs  of  1 2  of  which  only  4  are  required.  If  4  good  PE's  could  not  be 
organized  from  tne  block  tnen  the  whole  column  of  blocks  would  be  bypassed. 
For  PE  yields  between  40$  and  60$  this  scheme  performs  best  out  of  all  the 
schemes  presented.  In  fact,  Hedlund  intended  the  scheme  to  be  used  for  an 
array  where  33$  of  the  PE's  were  faulty  on  the  average. 

On  comparison  of  tne  schemes,  temporal  redundancy  appears  to  be 
efficient  only  in  situations  where  PE  idle  time  already  exists.  Interstitial 
redundancy  has  application  for  certain  yield  classes.  For  tne  reconfiguration 
schemes,  tne  more  complex  the  switch  type,  the  greater  the  flexibility 
afforded  but  also  tne  greater  the  area  overhead.  In  fact,  tne  more  complex 
and  flexible  scnemes  only  become  useful  for  lower  yields. 
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A  Systolic  SBNR  Adaptive  Signal  Processor 

MICHAEL  ANDREWS 


4*ur9ft  —  A  mw  mlusnoa  (or  adapnv*  signal  procnsing  umls  is 
profona4  a  luck  imi  a  spatial  tufeaal  ol  signal  digit  numbar  rapmama- 
noai  (SONR's).  T3us  ugnad  binary  numbar  raprasantatKM  (SBNR)  cap- 
rum  Ml  of  iba  tfflciawciaa  o 1  SONR  artlkaaaftc  but  also  maim  circuit 
raaiuanom  laaa  coflw.  Furtbanunra.  a  natural  mtadaca  batwaan  analog 
and  digital  numlaan  is  pro  i  Ida d.  Tha  tana!  on- Una  processing  nature  of 
SBNR  unici as  iba  MSB  (irst.  Aa  araa/tiaw  complasity  lor  VLSI  impla- 
aaaiManom  m  comparaMa  systolic  array  arcknacruras  contrasts  tha  »f- 
lactisanaaa  ol  lisa  dilfaraM  pnminsa  VLSI  cads  and  orgamt anona. 

I.  Introduction 

URST  [1)  has  noted  that  multi-valued  logic  (MVL) 
may  show  promise  in  the  future  for  VLSI.  At  present, 
binary  systems  are  facing  interconnect  problems  which 
appear  to  be  insurmountable.  Silicon  areas  devoted  to 
intrachip  connections  now  consume  twice  the  area  of  active 
logic  elements  on  the  chip  (2).  Array  implementations 
cause  a  severe  escalation  of  interconnect  area.  Likewise,  off 
chip  connections  are  generating  new  and  complex  thermal 
and  mechanical  problems  for  the  board  designer.  Such 
factors  seek  denser  information  content  to  interconnection 
ratios  In  this  paper,  a  redundant  arithmetic  solution  is 
proposed  and  coupled  with  MVL  relieves  some  of  the 
silicon  area  inefficiencies  with  conventional  binary  arith¬ 
metic. 

We  examine  one  implementation  of  ternary  arithmetic 
which  when  viewed  as  redundant  numbers  holds  promise 
for  division-sparse  signal  processing  applications  Section 
II  briefly  describes  basic  digit  number  properties  attractive 
to  signal  processing  which  manipulate  sequential  data 
streams.  Section  III  discloses  an  efficient  TRIT  ternary 
digits  realization  which  serves  as  the  primitive  VLSI  ceil. 
The  realization  utilizes  a  balanced  encoding  coupled  with 
encoded  redundancy  to  improve  both  logic  delay  and  gate 
count 

Section  IV  identifies  the  intimate  coupling  between  word 
and  bit  level  matnx  x  matrix  multiplication.  Section  V  de¬ 
scribes  a  systolic  implementation  of  the  least  mean  square 
(LMS)  algorithm  invoking  signed  binary  number  represen¬ 
tations  (SBNR's).  which  is  easily  realized  with  MVL.  The 
LMS  algorithm  is  a  difficult  adaptive  signal  processing 
benchmark  because  in-place  coefficient  methods  do  not 
applv  Section  VI  identifies  appropriate  ADC  and  D.AC 
SBNR  realizations.  Section  VII  contrasts  comparable  real¬ 
izations 

Muiuunpi  received  M*rch  1.  19*5  revived  September  '0.  1 4*5  Thn 
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UAAG24  *3  C  0025 
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II.  Signed  Digit  Number  Representations 

(SDNR) 

In  the  most  general  sense,  a  redundant  number  system 
allows  both  an  increase  in  the  number  of  positive  and 
negative  digits  as  follows. 

Defmmon  I. 

w&-':  RkRx--.  kR~Q  (1) 

ft  ~  l 

a..,  •  ■  ■  a0a.,a-2  •  a_„  -  £  ad‘  (2) 

where  the  digits 

d  e  R  -  {  -  r, .  -  rl  1 1.  •  •  ■  .0. 1 .  •  •  • .  r:  -  1 .  r.  } . 

r, .  r,  >  0  (  3  I 

The  representation  described  by  (1).  (2).  and  (3)  is  called 
redundant  notation  with  base  d.  The  basic  properties  of 
general  SDNR  arc  identified  in  Table  1  Aviziems  [3). 
Atkins  [4],  Tung  [5],  Ercegovac  [6],  and  Robertson  [7]  have 
shown  that  SDNR  can  effectively  operate  in  a  general 
purpose  digital  computer  for  the  reasons  noted  However, 
the  general  redundant  representation  does  not  lead  to 
efficient  implementations  unless  restrictions  are  placed 
upon  the  number  set. 

Ill  Efficient  SDNR  Realization 

Several  implementations  based  on  the  SDNR  have  al¬ 
ready  been  investigated  [8]— ( 11).  all  of  which  sought  to 
satisfy  general  data  processing  requirements  of  a  mainframe 
computer.  In  contrast,  signal  processing  applications  are 
multiplication/ addition  intensive.  An  efficient  SDNR 
realization  is  possible  if  we  select  the  following  redundant 
signed  binary  number  representation  (SBNR). 

ft 

£  x 2'~{.  x,  6  ( -  i.o.  ♦  n  i4i 

i  - 1 

with  the  notation  .V  for  -  X  (1  for  -  It  The  redundant 
representation  for  1  is  01  or  1 1  while  for  —l  it  is  01  or  11 

If  we  assume  a  two  digit  SBNR.  nine  siaies  are  possible 
covering  the  range  -  3  to  *3  with  all  representations 
unique  except  for  1  and  -  1  which  are  (01  or  1 1 1  and  oil  or 
1 1 ).  respectively  Of  the  27-state  inputs  for  a  luil  adder 
truth  table  li  e.  three  Mates  each  for  the  two  digits  to  he 
added  x,  and  v,  and  the  earn*  out  from  the  previous 
column  ( c,  , ),  onlv  six  distinct  cases  described  in  Table  1 1 
are  necessarv  if  we  always  represent  1  as  01  and  1  as  1 1 
Furthermore,  in  four  entries  1 1 .  3.  4.  and  6 i  the  cam.  out  is 
compleielv  determined  bv  x  and  v  Hence,  an  adder  need 
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SDNR  Pnofunu 


i 

i  Parameter 

Property 

Zero  .n  SDNR 

Unique  lfm-tl.li/  1 1  s  m  and 

>«  1  >  (J  1 1/2 

Addition  eubtraetion 

Totally  parallel  with  minimal  earn 

r. 

h  • 

horror  propagation  jnd  operation 
time  independent  of  word  length 

m 

Mt»si  digit 

No  epeetal  treatment 

Digits 

Positionally  weighted  »uh  -ign 

Negation 

Simple  logical  vomplemcntaiion  of 
Mgn  bin 

V  jnjhk  length  opvrand* 

Handled  easily  h>  right- to*  left 

methods 

Multiplication 

Tends  to  produce  rounded  result- 

£ 

Over  fit  detection 

Immediately  follow-  production  of 

most  significant  digit 

End  around  tarry 

None  hence  single  digit  ALU  -lice- 

•> 

are  identical  making  VLSI  highly 

A 

A 

regular 

s'  only  consider  carry  in  for  cases  2  and  5  in  order  lo  generate 
carry  oui  For  these  two  cases,  carry  out  depends  on 
whether  the  previous  x  ,  or  v,  ,  is  negative.  From  these 

i  considerations,  the  VLSI  circuit  proposed  by  Takagi  t\  ul 
|*)|  in  Fig.  1  suffices 

A  primitive  cell  suitable  for  large  VLSI  arrays  and 
especially  for  adaptive  signal  processors  must  have  few 
•Jj  interconnections  beyond  us  nearest  neighbors  and  very 
V\  simple  controls.  Fortunately,  many  'ignal  processing 
algorithms  can  be  implemented  with  bit-senal  arithmetic 

I 


.  i 

i. 

Fig  2.  Primitive  cell  (internal  data  flowi 


TABLE  11 

iNi  tKMtoiATK  Audition  Snr  Classes 


Type 

Augend 

1  V  ) 

Addend 

(  i,l 

Next  Lower 
Position 

(  1-  »,  !> 

Carry 

<o 

Imcrmcd 

Sum 

(J.) 

1 

1 

1 

0 

1 

Both  are  positive 

1 

0 

1 

1 

0 

At  lca>t  one 

ts  negative 

1 

0 

I 

l 

(I 

0 

— 

0 

0 

4 

1 

I 

- 

0 

0 

I 

0 

i 

1 

Both  arc  positive 

0 

0 

5 

1 

0 

At  Icifcl  otic 

negative 

0 

I 

I 

1 

h 

i 

1 

— 

I 

0 

Fig.  2  represents  the  cell  conceptually  with  dashed  lines 
indicating  the  “data  flow"  internal  to  the  cell  where  North. 
South.  East,  and  West  (N.S.  E.W)  data  paths  are  available 
Intercell  connections  shall  only  be  to  the  nearest  neighbors 
Furthermore,  latches  on  the  north  and  east  are  incorpo¬ 
rated  in  the  cell  to  aid  systolic  latching  of  operand  bus 
The  cell  is  now  utilized  in  a  systolic  array  where  the 
dominant  operation  of  n  x  n  main*  multiplication  is 
invoked. 

IV.  Matrix  x  Matrix  Multiplication 

Matnx  operations  may  be  either  sums  of  word  level 
products  or  sums  of  bit  level  products.  Furthermore,  a 
strong  relationship  exists  between  word  and  bit  level  sys¬ 
tolic  arrays  [12].  Treated  as  bit  level  manipulations,  fast 
area  efficient  VLSI  arrays  are  possible  (131.  P4I  I"  our 
SBNR  implementations,  a  systolic-like  bit  level  approach  is 
assumed  where  each  processing  cell  is  a  multiplier  and 
gated  full  adder. 

To  understand  this  word/btt  dualism,  we  consider  the 
implementations  at  the  word  level  and  show  how  bit  level 
similarities  apply  Multiplication  of  two  n  x  n  matrices. 
S-ts,)  and  H~(hk)  to  form  the  matnx  product 
Y  “  <  y,  )  becomes 

m 

v. ,  ”  £  t.J- 1.2.  •-.*»  15) 

*  - 1 

Without  any  loss  of  generality,  Y  may  be  considered  as 
independent  vectors  yr  The  aggregate  of  n  matnx  x  vector 
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product  evaluations,  each  of  the  type  in  (6)  comprises  the 
matrix  Y 

m 

y,  -  Z  *,k*k  (6) 

*  - 1 

However,  each  of  (6)  is  also  an  aggregate  of  n  inner 
product  evaluations  of  the  type 

y,  -  t.  <7) 

A  -  I 

Multiplication  of  two  matrices  now  becomes  a  senes  of 
unit  multiplications  of  (7)  and  an  accumulation  of  relevant 
product  terms  For  this  reason,  systolic  arrays  use  a  multi¬ 
plier/ accumulator  PE  Equation  (7)  can  be  partitioned 
further  into  a  senes  of  bit  level  sum  of  products.  The 
coefficient  of  each  power  of  two  in  the  result  now  becomes 
a  convolution  of  the  coefficients  in  the  two  operands.  This 
important  discovery  allows  us  to  organize  the  input  signal 
streams  so  that  operations  at  the  bit  level  are  pipelined 
onio  our  array,  as  in  Fig.  3.  The  tantamount  constraint  is 
that  the  physical  significance  position  in  the  array  must  be 
sialic  so  that  partial  products  are  accumulated  correctly. 
We  do  not  require  the  complicated  carry/ borrow  strategies 
found  m  two's  complement  systems  because  our  SBNR  has 
a  minimal  carry/borrow  distance 

Consider  the  multiplication  of  two  4x4  matrices  as  in 
Fig  3  This  diagram  portrays  the  interaction  of  the  bus  of 
two  sets  of  words,  ru  and  hkx.  which  compute  Each 
square  is  a  gated  full  adder  unit  of  cells  in  Fig.  2.  The  data 
words  in  Fig.  3  have  been  expanded  into  their  respective 
individual  bits  and  the  ith  row  is  associated  with  the  k  th 
set  of  words.  Words  rlA  enter  from  the  right  while  words 
h 4i  enter  from  the  left  in  a  bit  senal  manner  Although  we 
show  the  least  significant  bits  ( r,J4 .  ^i»\ )  entering  ahead  of 
the  next  significant  bus  ( r}4 .  V, , ).  the  MSB's  can  also  enter 
first  in  SBNR.  Lpon  forming  partial  products,  the  inter¬ 
mediate  results,  y,.  are  passed  vertically  downward. 

On  a  larger  scale,  the  partial  products  v, ,  are  generated 
as  in  Fig.  4  m  the  shaded  areas.  Dashed  lines  adjacent  to 
the  parallelogram  edges  are  guardbands  to  allow  for  growth 
generated  by  carry  bits.  For  m-bit  operands,  m  * 

1  2  log2.|  bus  are  necessary  These  guard  bands  are 


Fit  4  Partial  product  generation  of  mairu  x  mainx  multiplication. 


equivalent  to  spacing  input  parallelograms  with  guard- 
bands  filled  in  with  zero  bits. 

The  shaded  areas  which  move  down  vertically  generate 
partial  products  such  that  successive  cells  at  a  given  loca¬ 
tion  in  the  shaded  diamond  area  accumulate  all  terms  in 

ft 

<3) 

A  -I 

The  full  sum  of  products  is  formed  by  the  accumulation  of 
diamonds  emerging  from  the  bottom.  A  pipelined  tree  of 
adder  cells  connected  to  the  bottom  edge  generates  the  lull 
sum  which  can  be  clocked  out  least  significant  bit  first  or 
most  significant  bit  first  (SBNR  only).  The  full  sum  is  then 
computed  every  2m  +  1  clock  cycles. 

It  is  important  to  note  that  the  symmetry  of  diamonds  in 
Fig.  4  carries  over  directly  into  regular  VLSI  cells  with  few 
intercell  connections,  resulting  in  an  extremely  efficient 
VLSI  computational  array.  If  SBNR  numbers  are  not  used, 
carry/borrow  logic  and  intercell  data  paths  would  be 
complicated  by  the  same  level  of  complexity  necessarv  to 
fabricate  full  cany  lookahead  adders  (where  carrv  propa¬ 
gate  logic  grows  as  a  function  of  wordlength)  In  an  SBNR 
implementation,  oniy  nearest  neighbor  cell  paths  and  same 
ceil  replication  are  required. 

Another  advantage  to  SBNR  is  the  absence  of  special 
circuitry  and  algonthms  to  handle  signed  operands  In 
two's  complement  arithmetic,  the  Baugh  Woolev  algorithm 
can  be  used  (15]  In  this  procedure,  two's  complement 
words  are  treated  as  positive  numbers  if  1)  a  fixed  correc¬ 
tion  term  is  added  to  the  result  for  each  word  level  multi¬ 
plication.  and  2)  ail  partial  products  normally  with  a 
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133 


I 


SUI- 


I 

Truivirul 

1 _ 

“1 

FI  l  t*r 

Updato  j 

L 

[  4  t  tor  1 1  ha  (H)  | 

•*  r  •  filfr  Output  (Scalar) 

•  •  Error  Stanal  (Soalar) 

4  •  Traia ta4/Co*4 l i i oalnft  Siena)  (Scalar) 

Fig.  5.  Adaptive  signal  processor. 
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negative  weighting  are  complemented.  Two's  complement 
implementations  on  a  systolic  array  require  a  negative 
V  weighting  flag  or  a  tag  on  the  partial  products  which  must 
propagate  vertically  down  through  the  array.  Hence, 
another  latch  and  control  line  is  required  for  each  col- 
umnar  path.  Furthermore  final  addition  of  correction  terms 
W  requires  an  initialization  of  the  accumulators  in  the  adder 
c*-  trees. 


2 


I 


r." 
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V.  an  Implementation  of  the  LMS  Algorithm 

An  ^-sampled  LMS  adaptive  filter  as  depicted  in  Fig.  S 
captures  a  signal,  S,  into  a  transversal  filter  whose  scalar 
output,  y,  is  obtained  by  convolving  S  with  adapting 
coefficients  H.  An  error  signal,  e,  derived  from  the  filter 
output  and  a  training  signal,  d,  drives  the  LMS  weight 
update  algorithm.  The  transversal  filter  has  a  set  of  N 
registers,  each  of  length  K  bits  which  provides  storage  for 
the  N  x  K  array  of  bit  values,  B,  for  the  signal  S.  Bold¬ 
faced  characters  are  vectors  or  matrices.  The  independent 
time  variable,  t,  is  omitted,  but  is  implicit  to  discussions. 
Necessary  filter  scalars  and  matrices  are  defined  in  Tables 
III  and  IV. 

The  signal  vector,  5,  can  be  partitioned  into  the  BN  x  K 
array  as  in  (9). 


such  that  r/:  1  <  j  <  N,  0  <  i  q  k  (9) 

where  a  superscript  denotes  a  sampling  moment  and  a 
subscript  denotes  a  bit  position  in  a  X-bit  word.  The  signal 
vector  can  be  expressed  as 

S-BX.  (10) 

The  output  of  the  Tilter  is  given  by  the  convolution 

y-5TH  (11) 

where  the  column  vector  H  represents  the  set  of  N  filter 
coefficients 


h{n )  -  nth  coefficient  of  an  .V-poini  digital  adaptive  filter. 
f(k )  -  A  th  partial  product  used  in  the  output  accumulation 
j(  n )  «•  Af-bit  input  signal  sample  present  at  point  n  of  an  V-potnt  digital  filter. 
y  —  digital  filter  output. 

d  -  input  training  signal  lo  digital  adaptive  filler 
t  •  d  -  y  —  error  sample  generated  by  digital  adapuve  filler 


TABLE  IV 

Adaptive  Filtzs  Matrices 


5r-(a(l),j(2).- ••.!(»).  ,s(V)) 

tfr-(A<l).A<2).  •.*(«).'■•.  A<\)) 

F^-(/(l)./(2i,-  ./(*).  •./(*)) 

•T  r“(2~  ',2*  ■  -  ,2'  ‘ ). i.e..  the  set  of  the  first  K  negauve 

integer  powers  of  2. 

B  “  the  N  x  K  array  of  bu  values  which  results  when  a  Af-bu 
input  signal  vector  is  stored  m  an  .V-poini  digital  filter 


Define  the  column  vector  F  as 

F-BrH  (12) 

and  substituting  (10)  in  (11),  using  the  property  of  matrix 
transposition,  we  have 

y~XrF  (13) 

where  the  filter  coefficients,  F,  is  a  set  of  partial  products. 
The  LMS  algorithm  updates 

F'~  F  +  2ueBTBX  (14) 

where  u  is  a  convergence  rate  factor.  Equations  (13)  and 
(14)  form  the  iterative  computational  tasks  of  the  filter. 

Cowan  ef  a/.  (16)  have  observed  that  the  output  filter 
formulation  of  (13)  when  compared  with  (11)  reveals  the 
essential  elements  of  the  distnbuted  arithmetic  architecture 
of  the  LMS  algorithm  depicted  in  Fig.  6.  The  input  (ana¬ 
log- to-digilal  converter)  signals  are  presented  serially  to  a 
set  of  N  cascaded  X-bit  shift  registers.  As  this  serial  bit 
stream  enters  the  shift  registers,  the  shift  register  parallel 
outputs  generate  K  jV-bit  address  words  on  the  RAM 
address  bus.  Each  RAM  datum  is  then  right-shifted  K  bits 
and  accumulated.  The  accumulation  is  complete  after  K 
memory  accesses.  Finally,  an  output  sample  is  converted  to 
an  output  analog  signal.  As  in  our  implementation,  the 
distributed  arithmetic  architecture  uses  no  hardware  multi¬ 
pliers.  Using  (14)  in  a  matrix  by  matrix  multiplication 
scheme  naturally  captures  the  bit-serial  word-parallel  power 
of  systolic  arrays  behaving  as  SIMD  data-flow  engines. 

An  additional  circuit  reduction  is  possible  when  we 
utilize  the  latches  in  the  primitive  cells  of  Fig.  2  to  store  the 
input  signal,  S.  Now,  external  RAM  is  no  longer  required. 
As  a  result,  the  VLSI  implementation  is  more  compact. 
Furthermore,  vector  and  matrix  transposition  operations 
are  easily  accomplished  by  routing  signals  m  the  orthogo¬ 
nal  direction  since  the  primitive  cells  have  NS  and  EW 
bidirectional  ports  saving  considerable  time.  A  circuit  to 
implement  the  LMS  algorithm  is  shown  in  Fig.  7. 

This  architecture  utilizes  two  n  x  m  cell  systolic  arrays 
and  an  adder  tree.  The  upper  array  computes  the  filter 
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Fig.  6.  Distributed  arithmetic  architecture. 
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output  y  —  XTF  while  the  lower  array  updates  the  filter 
coefficients.  With  two  systolic  arrays,  as  configured  in  Fig. 
7,  filter  output  and  weight  update  can  be  pipelined  so  that 
the  total  computational  delay  from  signal  input  sample,  5, 
to  output  signal  sample,  y,  is  no  greater  than  one-bit 
conversion  of  the  ADC.  An  expensive  ADC  flash  con¬ 
version  is  not  necessary. 

VI.  ADC  and  DAC  Methods 

It  is  easily  seen  that  a  balanced  redundant  encoded 
number,  A ^  can  be  represented  by  a  “positive”  pan  and  a 
“negative”  part.  A*  and  A~ ,  respectively,  as  in  (15). 

A«-A++A~  (15) 

where  the  operator  “  +  "  is  the  normal  arithmetic  oper¬ 
ation.  For  example,  the  signed  digit  number,  IlOl,  is  the 
sum  of 

IlOl  -  (2J)  +  ( -  21  -  2°) 

-4-9-  -5.  (16) 

This  property  makes  digital-to-analog  conversion  trivial. 
The  circuit  of  Fig.  8  displays  the  essential  components  [17], 
Separating  A^  as  above  then  permits  us  to  simply  add 
the  parts  together  in  a  conventional  adder  whose  result  is 
represented  in  the  number  system  compatible  with  the 
interconnected  DAC  as  in  Fig.  9. 


The  ADC  realization  is  greatly  simplified  by  noting  that 
our  TRIT  representations  require  no  positive  number  re¬ 
coding.  Negative  numbers  need  only  change  the  represen¬ 
tation  of  the  leading  “one"  to  “I"  [9].  Two's  complement 
binary  numbers  carry  straight  across  to  SBNR  except  for 
the  leading  digit  and  only  if  the  number  is  negative.  As  a 
result,  any  ADC  can  be  directly  used  which  generates 
binary  numbers  (biased,  offset,  one’s  or  two’s  complement, 
sign-magnitude).  It  is  noteworthy  to  observe  that  these 
ADC/DAC  efficiencies  do  not  carry  over  for  SDNR  num¬ 
bers. 

VII.  Comparative  Performance 

In  this  section,  the  LMS  systolic  SBNR  architecture  is 
compared  to  four  other  architectures.  These  cases  are:  1) 
conventional  2’s  complement  binary  full-purallel 
adder/multipliers,  2)  distributed  arithmeuc  variation  of  (1) 
using  bit-wise  adders  across  the  filter  ups,  3)  idundant 
arithmetic  cells  replacing  the  adders/multipliers  of  (1),  and 
4)  bit-sequential  anthraet  :  ceils  replacing  the  adders/mul¬ 
tipliers  of  (1). 

The  LMS  algorithm  can  be  implemented  in  any  of  these 
architectures  with  either  sparse  or  fully  parallel/pipelined 
hardware.  When  implemented  with  2 V  multipliers  and  2.V 
adders,  as  in  Case  1  above,  no  faster  implemenution  is 
possible.  However,  for  most  applications,  2jV  multipliers 
are  overwhelmingly  expensive  in  VLSI  real  esute. 
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Fig.  tO  (a)  A  fully  parallel  conventional  architecture  (b)  Sips  bil- 
sequennaJ  architecture  (c)  Redundant  arithmetic  cell  architecture 


Comparable  architectures  are  depicted  in  Fig.  10.  Case  1 
(Fig.  10(a))  utilizes  the  most  hardware  ( IN  multipliers  and 
IN  adders)  in  the  conventional  fully  parallel  sense.  Case  2 
is  essentially  the  Cowan  architecture  of  Fig.  6.  Case  3  (Fig. 
10(c))  is  a  redundant  arithmetic  cell  proposed  by  Chow  [10] 
where  an  SONR  implementation  is  assumed.  Here,  each 
cell  incorporates  two  signed-digit  adders  and  one  signed¬ 
digit  multiplier  where  signed-digits  obey  the  properties  of 
(l)-(4).  Case  4  (Fig.  10(b))  is  a  bit-sequential  cell  approach 
also  replacing  the  adders/multipliers  of  Case  1.  This 
arrangement  proposed  by  Sips  [18]  makes  use  of  redundant 
arithmetic  but  not  as  efficiently  as  SBNR  implementations 
because  higher  radices  require  more  wire  interconnect  space. 

The  Sips  bit-sequential  ceil  can  be  configured  in  a  linear 
two-dimensional  array  for  +,  x,  and  +  operations. 
Fig.  10(b)  depicts  the  individual  full  adder  (FA)  cell  with  a 
D  flip-flop  for  latching  operands,  a  3-input  and  gate,  and 
a  2-input  XOR  gate.  If  east/west  as  well  as  north/south 
paths  are  necessary,  an  additional  flip-flop  is  required.  The 
XOR  gate  obtains  the  complemented  operand  (for  negative 
values).  Control  lines  XTL  and  XTL'  load  successively  new 
operand  bits  into  the  adjacent  column  for  systolic  addition 
as  shown  in  the  lower  portion  of  the  figure. 

The  SDNR  cell  depicted  in  Fig.  10(c)  has  been  proposed 
by  Chow  for  radix  16  members  of  the  set 
( — 10,  —  9,  •  •  • ,  0, 1, 2,  •  •  • ,  +9).  The  cell  operations  are 
described  in  Appendix  A.  Similar  to  the  Sips  cell,  it  uses 
redundant  numbers,  but  in  a  two-level  adder  scheme.  The 
second  adder  converts  signed-digit  numbers  to  conven¬ 
tional  binary.  Irwin  and  Owens  [19]  show  that  a  systolic 
array  of  such  cells  can  perform  digit  addition/subtraction 
in  four  gate  delays,  multiplication  in  six  gate  delays  and 
shifting  in  zero  gate  delays.  This  systolic  array  has  one 
severe  drawback.  Owens  [20]  shows  that  the  redundant 
number  set  must  be  symmetric  (i.e.,  |r,|  - 1/-2|  in  (3))  and 
multiplication  operand  digits  must  be  fractions.  As  yet,  no 
rapid  integer,  nonsymmetric  multiplication  algorithms  exist 
for  SDNR. 

Gate  costs  listed  in  Table  V  for  MOS  realizations  for 
each  primitive  logic  element  are  used  to  derive  the  relative 
area/time  complexities  of  typical  CAD  library  cells  needed 
for  each  architecture  contrasted.  We  assume  that  the  full 
adder  (FA)  circuits  require  18  MOSFETs,  4  cells,  3  levels, 
and  11  intraconnections.  Latches  are  fabricated  from  Z>- 
type  flip-flops  each  requiring  16  MOSFETs,  8  cells,  7 
levels,  and  9  intraconnections  [21,  p.  207].  Dynamic  shift 
registers  require  8  MOSFETs,  4  cells,  2  levels,  and  9 
intraconnections  per  bit  [21,  p.  222],  Static  MOS  RAM 
cells  each  use  6  MOSFETs,  4  cells,  1  level,  and  10 
intraconnections  [21,  p.  249].  An  Af-input  nand  (jV<4) 
gate  requires  N  + 1  MOSFETs,  N  + 1  cells,  1  level,  and 
N  +  2  intraconnections  [21,  p.  144], 

Any  VLSI  chip  is  composed  of  interconnection  area, 
effective  chip  area  occupied  by  library  cells,  and  an  over¬ 
head  area.  Assuming  then  that  a  silicon  compiler  is  used, 
the  area  and  time  complexities  for  common  library  cells  of 
Kronlof  [22]  are  relevant  here.  Table  VI  in  conjunction 
with  Table  V  relates  the  wordlength  K  to  the  L  successive 
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TABLE  V 

MOS  Realizations  of  Boolean  Functions 


Function 

srMOSFETs 

Cells 

Levels 

In  tracoonec  lions 

Inverter 

2 

l 

1 

2 

HAND 

N+l 

N  + 1 

1 

N  +  2 

Buffered  nand 

N  +  5 

N  +  5 

3 

N  +  5 

NOR 

N  + 1 

N  +  1 

1 

N  +  2 

XOR 

3N  +  3 

3N  +  3 

3 

3N  +  6 

2- Bit  Half  Adder 

15 

4 

3 

9 

2-Bil  Full  Adder 

18 

4 

3 

11 

1-Bit  RAM  (Static) 

6 

4 

1 

10 

1  -Bit  ROM 

2 

1 

1 

5 

1-Bit  Shift  Register 

8 

4 

2 

9 

D-Flip  Flop 

16 

8 

7 

9 

D- Rip- Flop  (Master  Slave) 

32 

16 

14 

20 

S-R  Latch 

6 

4 

2 

6 

TABLE  VI 

Lirrary  Cell  Relative  Area/Time  Complexities 


Component  Type 

Area  Compl. 

Time  Compl. 

Parallel  Multiplier  ( B  x  B) 

B1 

B 

Accumulator  ( K  -  LB  +  2) 

—  adder  ( Brent-  Rung) 

K\o%K  +  1 

Iog<  +  l 

—shifter,  eg 

K 

1 

Adders  (Brent- Rung) 

B  log  B  +  1 

log  B  + 1 

Coefficient  Memory 

BL 

1 

Pipeline  Register 

BL 

1 

Register.  Ports,  e.g 

B 

1 

LB  out  of  2B  switch  (MUX’s) 

LB)  L  +  2  B ) 

1 

Iteration  Control  (Counter) 

L  log  L 

1 

Queue  Elements 

BL 

1 

Systolic  Cells 

Chow(SDNR) 

4 

2 

Sips 

1 

1 

S3NR 

1 

1 

TABLE  VII 

Comparison  of  Architectural  Complexity  Systouc  Array 


Convenuonal 

Binary 

(2\  multipliers) 

(2 y  adders) 

Distributed 

Arithmetic 

(Cowan) 

Bit-Sequential 

Cells 

(Sips) 

Redundant 

Anth. 

Cells 

(Chow) 

Redundant 

Anth. 

Cells 

(SBNR) 

Gate  Complexity 

0(2  mN) 

0 <A/V) 

0(*m) 

Q(  km) 

0 (AiV) 

Latency 

V  +  l 

kN  bit 

i  +  one 

one  ADC 

one  ADC 

memory 

shifts 

ADC  bit 

digit 

bit 

writes 

conversion 

conversion 

conversion 

VLSI  Amenable 

structure  irregular 

moderate 

yes 

yes 

yes 

Estimated 

not  appropriate 

not  appropnate 

40* 

10  m 

10 

Pin  Count/Cell 

Area  Complexity 

>  2S(  B2  +  ATIogX  +  l 

2  K  log  K  +  1  +  2  K 

K  log/f  +  1 

2 (fl/m)3 

K  log  K  +  1 

*  K  *  BL) 

+  2BL+2B 

+  BlogB  +  l+  2B 

+  4<  B  log  B  +  1 ) 

+  B  log  B  + 

Time  Complexity 

>  max(  B  or  log  K  +  1) 

log  B  *  1 

log  B  +  1  +  2  log  K  +  1 

mB 

B 

'Assumes  each  cell  is  a  4-bit  slice. 
m  -  number  of  digits  in  a  word. 

*  -  number  of  bits  to  each  shift  register  ( *  <  m ). 

,V  -  number  of  filter  coefficients. 

i  -  small  positive  constant  (3  or  4)  less  than  the  time  to  complete  a  full  bit-parallel  operation. 
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bit  fields  of  B  where 

B  -  \K/L\  bits.  (17) 

For  two's  complement  numbers  in  binary  fixed-point  rep¬ 
resentation,  the  real  value,  X,  can  be  represented  by  an 
unsigned  binary  integer  value,  Xb,  and  the  MSB  bit,  XQ  as 
in 

X-Xt/2La-'-2X0.  (18) 

Hence,  a  wordlength  K  has  LB  bits.  The  area  penalty  of 
wires  is  proportional  to  B  and  to  the  square  root  of  the 
effective  chip  area.  Power  distribution  lines  and  bonding 
pads  constitute  the  overhead  area.  An  SBNR  cell  is  as¬ 
sumed  to  have  unity  area  and  time  complexity  because 
each  cell  is  basically  a  “one-bit"  device.  The  Chow  ceil 
essenually  has  an  area  complexity  four  times  the  SBNR 
because  its  SDNR  realization  basically  assumes  4  bits  per 
digit  on  a  radix  16  representation.  The  time  complexity  is 
double  because  another  level  of  logic  depth  is  required. 

Using  Table  VI,  the  area/time  complexities  of  each  of 
the  five  architectures  can  be  compared.  Table  VU  also  lists 
the  gate  complexity  latency,  VLSl-suitability,  and  pin 
count/cell. 

VIII.  Conclusions 

The  conventional  binary  architecture  is  hardware  inten¬ 
sive  yet  is  ultimately  the  fastest.  The  distributed  arithmetic 
is  a  compromise  between  speed  and  silicon  space.  How¬ 
ever,  a  regular  design  for  VLSI  is  not  easily  achieved  since 
no  repetitive  cell  is  utilized  as  in  the  three  systolic  imple¬ 
mentations.  Of  these,  the  SBNR  systolic  implementation  is 
highly  regular,  possessing  very  short  signalling  wires.  Fur¬ 
thermore,  local  control  in  this  self-timed  synchronous  sys¬ 
tem  eliminates  the  need  for  global  control  lines  which 
degrade  performance  of  synchronous  systems  as  in  the 
conventional  binary  architecture. 

A  number  system  entirely  composed  of  signed-bits 
(-1,0,1)  amenable  to  ternary  valued  circuits  has  been 
proposed  for  signal  processing  units  where  add/multiply 
cycles  dominate.  Such  SBN  R  implementations  can  be  con¬ 
figured  as  a  systolic  array  to  perform  n  x  n  matrix  oper¬ 
ations.  Because  the  carry  /borrow  distance  is  minimal  for 
SBNR,  in  tercel]  communication  is  reduced.  As  a  result, 
extensive  carry-propagation,  lookahead  hardware  is  no 
longer  required  and  mathematical  operations  are  no  longer 
dependent  on  wordlength  as  in  conventional  two's  comple¬ 
ment  binary  systems.  Thus  synchrony  so  vital  to  systolic 
arrays  is  more  easily  achieved  and  true  data-flow  SIMD 
machines  result. 

Although  the  area  and  time  complexities  of  the  three 
systolic  arrays  are  comparable,  the  latency  (time  interval 
from  input  signal  to  output  signal)  is  smallest  for  the 
SBNR  array.  Furthermore,  the  SBNR  offers  a  successful 
fault  tolerant  implementation  (4J,  [10],  The  estimated  pm 
count/cell  is  smallest  for  the  SBNR  array.  This  is  because 
the  left-to-right  (MSB-to-LSB)  computing  property  of 
SBNR  numbers  allows  us  to  begin  computations  upon 


receipt  of  the  MSB  from  the  ADC.  In  view  of  these 
properties,  we  conclude  that  the  SBNR  systolic  array  is  a 
competitive  if  not  superior  alternative  to  other  implemen¬ 
tations.  We  anticipate  future  signal  processing  architec¬ 
tures  will  take  advantage  of  SBNR. 

Appendix  A 

SDNR  Chow  Cell  Operations 

The  SDNR  cell  of  Fig.  10(c)  is  capable  of  addition, 
subtraction,  multiplication,  and  assimilation  (which  assists 
data  conversion).  All  operands  are  assumed  to  be  normal¬ 
ized  floating-point  numbers  of  the  form 

m 

Z  V"‘  ( A- 1 ) 

i-0 

where  E,  is  an  integer  with  an  e-bit  two's  complement 
representation.  In  the  following  operations,  /'  and  w'  are 
transfer  digits  which  perform  the  same  intermediate 
carry/borTOw  functions  as  our  SBNR  Zdl,  Z„,  and  Crfl, 
C„  bits  except  that  /'  and  w'  digits  each  require  multiple 
lines  (e.g.,  a  radix  16  SDNR  digit  can  be  represented  with  5 
bits,  one  for  sign  and  four  for  magnitude). 

Definitions 

H'mm»  ~  1)/2|- 

if  p-Kr-l)/2|  is  even, 

“Kr  - 1)/2|+ 1,  otherwise. 

-Kp2  - 

■*ma»  ”KP  4-  um«*  wmui  ~  ^ )/  f I* 

Dh  contains  digits  from  0  to  r  - 1. 

D(,_ ,,  +  contains  digits  from  0  to  r  -  1. 

Dp  U  +  contains  digits  from  p  to  r  -  1. 

Input  Digits 

The  input  digits  a ,  b,  and  c  belong  to  the  digit  sets 
D„  U  D[r_ ,,  + ,  D„,  and  Dp,  respectively. 

The  transfer  digits  /'  and  w‘  belong  to  the  digit  sets 
and  respectively. 

The  borrow  /}'  is  either  0  or  1. 

Functions  of  the  Three  Levels 

[MJ  If  Af-MULT  - 1 

then  rv  +  w  —  be  with  u  in  £>,„  ,  and  w  in 

4-^1. 

else  u  —  b  and  w  «-  0. 

(51)  If  Sl-ADD - 1 

then  rx  +  t  —  a  +  w'  +  u, 
else  rx  + 1  *-  a  —  w'  —  u. 

In  either  case,  x  is  in  Z>u  )  and  r  is  in  Dlf  , 

[52]  If  ASSIM-1 

then  -  rfi  +  s  «-  a  -  {}'  with  s  in  D[r_  ,,  +  and  /3 
is  0  or  1, 
else  s  *- 1'  +  x. 
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ABSTRACT 

A  comparative  study  of  various  number  systems  1- 
the  relutive  merits  for  real-time  signal  processing 
signed  digit,  redundant  number,  and  a  new  repr 
binary  number  representation  (SBNH)  are  contrasteu 
complexity.  It  is  shown  that  the  SBNK  system  h 
complexity  and  regularity  attributes  amenable  to  V. 
proposed  for  minimal  intercell  connectivities,  a  prere 
array  implementations.  A  test  case  realizing  the  least-squares  algorithm 
in  a  systolic  array  for  adaptive  beumforming  applications  indicates 
comparative  differences.  Executing  a  square-root  free  Givens  rotation 
matrix  operation  iteratively  in  a  two  systolic  array  configuration 
demonstrates  real  time  signal  processing  for  beamforming. 
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INTRODUCTION 

A  comparative  architecture  study  was  performed  in  order  to  implement 
the  scaled  Givens  rotation  solution  to  the  least-squares  minimization 
problem.  Three  architectures  are  examined:  a)  Conventional  Systolic 
Array,  b)  Distributed  Arithmetic  Array,  and  c)  SBNR  Systolic  Array. 
Important  considerations  in  adaptive  beamforming  algoritiun  to  architecture 
mapping  include  gate  count  estimates  for  some  of  the  architectures  and 
tables  for  performing  these  estimates.  A  systolic  architecture  tor  an 
aduptive  boamformer  trucking  system  is  developed  for  performing  recursive 
least-squares  minimization. 

The  purpose  of  this  project  is  to  identify  engineering  trade-offs  and 
interconnection  strategies  capable  of  aemeving  real-time  implementation 
of  signal  processing  algorithms  via  limited  uuer-progranuuab  le  mechanisms 
(e.g.,  firmware).  Flexible  firmware-oriented  archl  tec  lu  reu  dedicated  to 
signal  processing  can  then  be  identified.  The  specific  test  algorithm 
performs  an  orthogonal  t  r  i  a  rig  u  lu  r  1  za  t  i  0  n  of  the  data  matrix  using  a 
pipelined  sequence  of  Givens  rotations  and  generates  the  required  residual 
without  having  to  solve  the  associated  triangular  linear  system  by  back- 
substitution. 


Array  Architectures 

Systolic  array  architectures  remain  diverse.  At  the  extreme  ends  are 
the  WARP  array  and  the  GAPP  array.  WARP  utilizes  68000  microprocessors  in 
euch  processing  element  (PE).  GAPP  uses  n  1  bit  ALU  with  1  PH  bit  RAM  as 
each  PE.  Although  more  primitive  (btsOOO  is  a  1o  bit  parallel  engine), 
GAPP  is  a  single  chip  of  12  PE's.  Because  of  its  high  speed  and 
availability,  GAPP  is  viable.  Between  these  outlying  architectures  lie 
conventional  and  distributed  arithmetic  processing  cell  compositions. 


This  work  was  sponsored  in  part  by  Army  Research  .Office  Contract  #DAAG2'j- 
8}-C-002^.  The  views,  opinions,  and  findings  contained  in  this  repurt  arc 
those  of  the  author  and  should  not  be  construed  as  an  official  Department 
of  the  Army  position,  policy  or  decision,  unless  so  designated  by  other 
documentation. 
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A  study  of  basic  PE’s  was  made.  A  new  PE  utilizing  a  primitive  cell 
is  proposed  for  a  systolic  array  PE.  Anotner  alternative  PE  baaed  on  a 
distributed  arithmetic  cell  was  studied.  This  cell  increases  computation 
speed  by  reducing  multiplication  to  table-lookup  of  partial  products  and  a 
series  of  anift/ado  operations. 

Boamfonning  Architectures 

An  antenna  beam  is  a  collection  of  point  sources  or  receptors  where 
geometry  governs  tne  characteristic  equations  of  the  system.  For  a 
uniformly  spaced  line  array  as  depicted  in  Figure  1  ,  the  following  form 
applies: 


N-1 

0(a)  -  I 


gne"^2  * (ndcos(a)A) 


Equation  1  has  the  basic  form  of  a  DFT.  When  we  consider  that  cos  a  * 
(k/N)  (i/a),  the  beumformer  output  at  angles  a  i3  computable  by  the  Dr'T 
as  follows. 

N-1 

Gk  “  1  g  e-j2*nk/N  (2) 
n-0 

From  this  we  can  easily  see  that  a  2-D  temporal-spatial  Fourier  transform 
can  form  beams  in  the  nonuniformly  spaced  look  directions.  Adaptive 
beamforming  must  then  cause  the  beam  pattern  to  favor  certain  spatial  or 
spectral  parameters. 
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Figure  1  Line  Array  Beamforming 

Adaptive  Beamformer  and  Tracker  System 

Feintuch,  et"!  al . 1  ,  investigated  an  adaptive  tracking  system  which 
employed  the  LMB  algorithm  to  minimize  the  error  between  two  bourns  of  a 
split  array.  The  weights  generated  are  analyzed  to  determine  the  max 


weight.  It  rougnly  corresponds  to  the  delay  between  the  phase  centers  of 
the  two  beams.  The  pnase  or  time-delay  is  then  used  to  provide  a  bearing 
estimate  (for  adaptive  nulling,  etc.). 

We  consider  a  least-square  implementation  of  the  adaptive  tracker  to 
construct  a  completely  systolic  adaptive  be am f o rme r / t rack e r  from  the 
systolic  Givens  rotation,  DFT,  and  backsubstitution  architectures. 
Femtuch  provides  a  suitable  starting  point  for  incorporating 
adaptability.  Simply  stated,  we  use  the  peuks  between  weights  to 
electronically  steer  the  beam  to  force  nulls  at  jumrner  angles.  A  time- 
domain  leas t- aqua  res  aduptive  tracking  system  can  be  configured  as  shown 
in  Figure  2.  Two  inputs  are  required  for  the  adaptive  tracker.  Multiple 
time  "snapshots"  of  these  samples  are  collected  to  form  tne  jr(n)  and  X(n) 
and  computes  the  weight  vector  _W(  *0  which  minimises  tne  leaat-squarco 
norm: 

E(n)  ■  i;e(n)i!  -  ! i X( n) W( n) ♦*< n) } J  (?) 

The  largest  tap  of  the  weight  vector  is  then  found  and  provides  the  phase 
bearing  estimate. 


Figure  2  Least-Squares  Time  Domain  Adaptive  Tracker 


In  the  frequency  domain  solution,  shown  in  Figure  2,  domain  inputs 
undergo  a  Fourier  transform.  Multiple  time  "snapshots"  of  one  half  the 
array  are  taken  to  produce  a  matrix  of  frequency  components  for  the  LS 
algorithm.  The  largest  tap  over  the  frequency  weights  is  selected  and  the 
phase  provides  a  bearing  estimate  which  is  used  to  steer  the  beum. 

Systolic  Adaptive  Bcamformer  and  Tracking  System  • 

Figure  3  shows  a  complete  frequency-domain  adaptive  beumfonner  and 
tracking  system.  The  computational  intensive  components  of  the  system  are 
systolic  array  modules. 


FIRST  ARRAY  HALF 


Figure  3  Frequency  Domain  Adaptive  Beumforraer  and  Tracker 

The  K-point  DFT  modules  of  the  system  perform  a  Fourier  transform  of 
the  time  domain  input  data.  A  system  consideration  at  this  point  is  the 
Fourier  transform  throughput.  The  phase  shift  multiply  is  driven  by  the 
phase  estimate  from  the  adaptive  tracker.  Each  frequency  component  from 
the  Fourier  transform  is  multiplied  by  the  terra 


where  T^  j_3  a  function  of  the  steering  angle.  A  multiplier  array  operates 
at  this  function  block.  A  conventional  distributed  arithmetic  engine  or 
SBMR  distributed  arithmetic  engine  muy  be  ideul  for  this  array  since  iS 
only  position  dependent  and  the  steer  angle  is  the  only  variable  and  non¬ 
position  dependent  quantity  in  the  computation  of  T  f q r  a  fixed  sensor 
array.  Hence,  an  ultra  fast  table  look-up  of  e^^iu^ra  can  occur  baaed 
solely  on  the  steering  angle. 
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An  adder  array  is  used  to  form  the  frequency  bins  of  the  beam.  Since 
the  beam  ij  already  in  the  frequency  domain  after  the  summing  operation, 
the  bins  can  be  fed  directly  to  tne  LS  algorithm.  The  L3  block  consists 
of  a  Givens  rotation  systolic  array  and  a  ba cksubs t i t u t i on  array  to 
compute  the  weights.  .The  peak  of  tne  weight  vector  can  be  found  using  a 
quadratic  interpolate r .  The  interpolator  performs  a  quadratic  fit  to  tne 
largest  weight  element  and  the  two  adjacent  weights  in  the  frequency 
domain. 

Systolic  Army  Least  Squares  Solution 

McWhirter^  proposes  a  set  of  y  primitive  cells  arranged  in  a 
triangular  systolic  array  wnich  performs  recursive  least-squares 
minimization.  Orthogonal  t  nungula  r  i  z  a  t  i  o  n  of  tne  data  matrix  is 
performed  using  a  pipelined  sequence  of  square-root  free  Givens  rotations. 
The  square-root  free  Givens  rotation  triangular  systolic  array  is  shown  in 
Figure  4.  The  associated  primitive  cells  are  given  in  Figure  5.  To 
provide  a  common  performance  testbed,  the  conventional  binary,  SBNR  ,  and 
distributed  arithmetic  architectures  were  studied  based  on  an 
implementation  of  this  systolic  array. 

Barlow  and  Ispen^  developed  a  scaled  Givens  rotation  systolic 
algorithm.  The  scaled  Givens  rotation  algorithm  operates  on  banded 
matrices  of  width  w  ■  p  ♦  q  ♦  1  where  p  i3  the  number  of  superdiagonals 
and  q  is  tne  number  of  subdiagonals.  Assuming  a  rows  in  the  banded 
matrix  and  a  right  hand  side  vectors,  tne  number  of  computation  steps  are 
given  by: 

2m  ♦  3(q*' )  *  z  ♦  1  (4) 

The  individual  cell  complexity  (the  number  of  equations  solved  at  each 
cell)  is  approximately  the  same  as  those  for  the  square-root  free  Givens 
rotation.  Only  one  division  operation  is  required  in  the  scaled  Givens 
rotation  and  many  of  the  multiplies  are  reduced  to  shift  operations.  Bo tn 
scaled  Givens  rotations  and  square-root  free  Givens  rotations  have 
processor  utilizations  of  approximately  50%. 

Conventional  Binary  Implementation 

The  G  A  F  P  ^  ^  is  a  commercial  ly  available  systolic  array  device- 
providing  72  conventional  binary  processing  elements,  dimensioned  as  a  6  X 
12  rectilinear  array.  The  square-root  free  or  scaled  Givens  rotation  is 
easily  implemented  on  this  device.  In  order  to  obtain  realistic  speed 
estimates  for  a  conventional  binary  implementation  of  the  square-root  free 

Givens  rotation,  code  for  the  GAPP  device  has  been  written  in  a  C-liko 

TM 

language  known  us  GAL  . 

The  GAL  3quare-root  free  Givens  rotation  code  requires  approximately 

(2r  ♦  c  ♦  I)(d'jn2  ♦  224 n  ■*•156)  (5) 

instruction  cycles  where  n  is  the  bit  length  of •  tne  input  operands,  r  is 
the  number  of  matrix  rows,  and  o  is  the  number  of  matrix  columns.  The 
latency  from  first  input  to  first  residual  output  is 

(c  ♦  r  ♦  l)(83n2  +224n  ♦  15b) 


(6) 
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Figure  4  Systolic  Array  for  Recursive  Least-Squares  Minimization 
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Figure  b  Cells  Required  for  the  Square-Root  Free  Givens  Rotation 


The  time  to  complete  toe  entire  matrix  reduction  increased  linearly 
with  the  size  of  tne  amy.  The  number  of  elements  processes,  however, 
increases  as  tne  square  of  tne  array  size.  To  estimate  tne  processing 
power  of  tne  GAP?  solution,  compute  the  number  of  array  elements  processed 
per  instruction  cycle.  ■  Assuming  a  fixed  word  length,  tne  word  ler.gtn 
dependent  term  is  a  constant  k  *  d'jn^  *  224n  ♦  1  po .  The  number  of  array 
elements  processed  per  cycle  is: 

rc/(2r  ♦  c  ♦  l)k  elements/cycle  (7) 

It  can  be  sawn  that  for  square  matrices  (r  •  c)  the  number  of  elements 
processed  per  cycle  increases  quad  r,»  t  ica  1  ly  as  the  array  size  increases. 
Improvements  in  speed  may  be  obtained  if  concurrency  can  be  achieved  in 
the  operation  of  the  three  cell  types  of  Nckhirter'a  algorithm.  A 
promising  solution  is  to  use  three  separate  arrays  (one  for  each  cell 
type),  and  carefully  synchronize  data  flow  between  the  urraya. 

SBNR  Implementation 

A  mesh  connected  systolic  array  of  S8NR  cells  is  used  to  implement 
Mckhirter's  algorithm.  A  single  3BNR  cell  is  shown  in  Figure  6.  It 
consists  of  an  appropriate  set  of  registers  which  act  as  input  to  an 
intermediate  SBNR  ALU  and  a  final  3BNR  ALU. 

It  is  possible  to  derive  the  minimum  execution  speed  and  latency  for 
thia  particular  SBNK  implemen ta t ion  of  Mctfhirter'a  systolic  array  by 
considering  the  data  dependencies  in  the  equations.  The  maximum  data 
dependency  patn  length  for  the  boundary,  internal,  and  final  cells  are  5, 
2,  and  10,  respectively.  Using  tne  2 N ♦ 1  formula  for  latency,  we  can 
compute  latencies  and  speed  estimates  for  each  cell. 

The  maximum  boundary  cell  latency  is  11  cycles.  At  the  internal 
cell,  tne  maximum  latency  is  5  cycles.  At  the  final  cell,  tne  maximum 
latency  is  3  cycles.  If  r  is  the  number  of  rows  and  c  is  the  number  of 
columns,  then  the  maximum  latency  to  the  first  residual  is: 

L(c, r)  -  1 1 (c  ♦  r  ♦  1 )  (d) 
The  execution  time  for  the  entire  Givens  rotation  is: 

S(c,  r,n)  -  (2r  *  c  *  1)(11  n  -  1  )  (^) 

where  n  is  the  bit  length  of  the  operands.  Notice  that  the  execution  time 
is  linearly  dependent  (0(n))  on  the  bit  length  of  the  operands  where  tne 
GAPP  array  execution  time  is  quad ra t ical ly  dependent  (o(n‘-)j.  This  is  a 
result  of  the  reduction  of  multiplication  complexity  in  UBNR  arithmetic. 

Distributed  Arithmetic  Implementation 

The  goal  in  distributed  antnraetic  architectures  is  to  reduce 
computation  time  by  performing  table  look-up  to  produce  partial 
computation  results.  For  example,  multiplication  can  be  reduced  to  a 
table  look-up  of  partial  products  followed  by  a  series  of  shift  and  adds 
to  obtain  the  final  result.  In  an  N  bit  by  N  bit  multiply,  it  is  possible 
to  divide  each  operand  into  k  segments.  By  combining  each  segment  of  one 
operand  with  every  other  segment  of  the  other  operand,  an  address  in  RAM 
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of  each  partial  product  is  formed.  The  partial  products  are  looked  up 
and,  through  a  series  of  sniftu  ana  adds,  accumulated  to  form  the  final 
product.  Table  1  shoes  a  comparison  of  a  typical  N  bit  multiply  using 
conventional  binary  versus  distributed  arithmetic. 
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Figure  6  An  CBNR  Primitive  Cell  for  a  2-D  Mesh  Connected  Array 


V  . 
V. 

V". 


8 


Table  1  N  Bit  By  N  Bit  Multiply  Comparison 
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n  shifts 
n  adds 

n^  ands  (bit-wise) 


k^-1  shifts 
k^-1  adds 
^  table  look-ups 


If  RAM  access  speed  is  greater  than  the  computation  time  for  n^  AND 
operations,  then  distributed  arithmetic  provides  better  performance  than 
conventional  arithmetic  for  k  <  n.  There  is  a  trade-off  between  table 
size  and  computation  speed.  The  table  size  for  any  distributed  arithmetic 
multiplier  is  words.  While  computation  time  is  directly 
proportional  to  k,  the  table  is  indirectly  proportional  to  k  (i.e.,  large 
k  implies  larger  computation  time  but  smaller  table  size). 


An  n  bit  distributed  arithmetic  computational  element  is  shown  in 
Figure  7.  This  computational  element  is  a  single  bit  ALU  with  the  special 
feature  that  a  table  address  register  can  be  loaded  and  a  partial  product 
retrieved  for  further  computation. 
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Figure  7  Distributed  Arithmetic  Primitive  Liement  Ar  ;r. .  te  - : „  - 

This  cell  can  be  incorporated  in  a  me3h-connec t e  :  ays*.  .. 
perform  the  Givens  rotation  by  McWhirter's  u  igo  <■!  trim . 
assumed  to  operate  like  the  bi t-3equen tial  cell  of  tr.e 
that  the  multiplication  13  no  longer  C .  r.4- ;  but  o  .  - 

latency  and  execution  time  can  be  made  Dy  remov  i  *■  • 
estimates  made  for  GAPP.  Thus  the  latency  - 
arithmetic  implementation  of  McWhirter's  aigon  t.-.m 

( c  ♦  r  ♦  1 ) ( 224n  +  156) 
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and  the  execution  time  is  approximately 


(2r  ♦  c  ♦  1 )(224n  ♦  156)  (11  ) 

COMPARATIVE  ANALYSIS 

A  comparison  of  systolic  primitive  elements  is  presented  in  Table  2. 
Five  architectures  are  examined. 

a.  Conventional  binary  bit  sequential  cell  (GAPP) 

b.  Conventional  binary  (complex  cell) 

c.  Distributed  arithmetic 

d.  Signed  binary  number  representation  (complex  cell) 

e.  Signed  binary  number  representation  (mesh-connected  PE's) 

The  architectures  were  contrasted  assuming  a  square-root  free  Givens 
rotation  implementation  to  obtain  speed  and  latency  estimates.  The 
conventional  binary  (complex  cell)  and  3BNR  (complex  cell)  are  both 
algorithm  dedicated  architectures.  As  a  result,  they  huve  irregular 
structures  and  are  not  VLSI  amenable.  The  conventional  binary  (complex 
cell)  architecture  has  0(1)  speed  and  latency.  The  3BNR  (complex  cell) 
and  the  3BNR  (inesh-connected  PE),  which,  by  the  way,  is  VLSI  amenable, 
have  0(n)  and  0( 1 )  speed  and  latency,  respectively.  The  distributed 
arithmetic  architecture  exhibits  0(n)  speed  and  latency. 

The  conventional  binary  (complex  cell)  is  superior  in  terms  of  speed 
and  latency  only.  The  distributed  arithmetic  and  SfiNR  (mesh-connected  PE) 
architectures  have  excellent  speed  and  latency  and  both  are  VLSI  amenable. 
Distributed  arithmetic  bandwidth  is  smaller  than  SUNK  (mesh-connec ted  PE) 
bandwidth,  however,  SBNR  (mesh-connected  PE)  has  a  superior  latency.  For 
adaptive  beomformer  implementation,  both  architectures  are  viable. 

Table  2  Comparison  of  Systolic  Array 'Architectures 


Complexity  and  Performance  Cell  Type 


Conventional 

Binary 

Bit  Seq.  Cell 
(GAPP) 

Conventional 
Binary 
(l  Adders)1 
(3  Multipliers) 

Distributed 

Arithmetic 

Speed  (2r+c+ 

l)(83n2*224n*156) 

3(2r+c«-1 ) 

(2r*c*1 )(224n+156) 

Latency  ( r+c*1 ) (tt5n^>224n^1 ) 

3( r+c+1 ) 

(r+c*1 )(224n*1 ) 

Cell 

Complexity 

Simple 

Complex 

Simple 

I/O  Bandwidth 

c 

cn 

c 

VLSI  Amenable 

Yes 

structure 

irregular 

Yes 

Algorithm 

Dedicated 

No 

Yes 

No 

Gate  Counts 


1.  For  Boundary  Cell.  Internal  cell  requires  2  Adders,  2  Multipliers. 
Final  cell  requires  1  Multiplier. 


SBNR 

(Complex  cell) 

SBNR 

(mesh-connected  PE's) 

Speed 

(2r+c+1 )(20+n) 

0(n) 

Latency 

20(r*c*1 ) 

0(1) 

Cell 

Complexity 

Complex 

(multiple  SBNR  PE's) 

Simple 

I/O  Bandwidth 

2c 

2c 

VLSI  Amenable 

structure 

irregular 

Yes 

Algori thm 
Dedicated 

Yes 

No 

Cate  Count 

re  (276b  ♦  71og_w 
♦  42 w  ♦  64 n; 

rc(lb7  ♦  log0w 
♦  6w)  £ 

Table  2  Notation: 


r  -  rows  of  rectilinear  matrix 
c  -  columns  of  rectilinear  matrix 
n  -  word  length 
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VLSI-MVL  IMPLEMENTATION  OP  A  PAST  ARITHMETIC  CELL  WITH  SBNR 
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ABSTRACT 

To  meet  the  demands  of  high-speed  si 
the  size  constraints  of  VLSI  implementati 
processing  element  is  described.  This  elemen 
Number  Representation  (SBNR)  to  achieve  fully 
The  result  is  an  adder,  suitable  for  use  in  V 
vhose  throughput  is  independent  of  word  length.  Use  of  SBNR  also 
reduces  intrachip  connection  area,  thus  allowing  a  higher  device 
density  to  give  each  chip  correspondingly  greater  processing 
power. 
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INTRODUCTION 

While  the  demand  for  faster,  more  powerful  signal  processors 
haa  increased,  the  space  allotted  them  has  decreased.  A  need 
exists,  therefore,  for  very  fast  arithmetic  circuits  that  can  be 
very  densely  integrated.  This  requires  that  the  circuits 
designed  consume  little*  power  and  require  little  area  for  devices 
and  interconnections.  In  addition,  a  system  should  be  found  that 
allows  fast  arithmetic  processing  with  little  added  chip  area. 

Conventional  adders  add  two  numbers  from  lowest  digit-place 
to  highest,  propagating  carry  down  the  length  of  the  number. 
Thus,  the  length  of  time  for  the  addition  depends  on  the  word 
length.  This  paper  describes  an  adder  that  meets  the 
requirements  for  VLSI  implementation.  The  adder  uses  signed 
binary  number  representation  (SBNR)  to  achieve  totally  parallel, 
carry-free  addition.  The  logic  circuits  used  were  developed  to 
allow  VLSI  implementation  of  multiple  valued  logic.  The  adder  is 
compared  to  conventional  designs  to  show  a  speed  increase  for 
various  operand  lengths. 
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ADDITION  IN  SIGNED  BINARY  NUMBER  REPRESENTATION 

A  redundant  number  representation  is  one  in  which  several 
different  digit  combinations  can  represent  the  same  number.  The 
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signed  digit  number  representation  (SDNR)  uses  positive  and 
negative  digits  to  represent  a  number.  This  special  coding  allows 
faster  arithmetic,  because  the  carry  propagation  distance  is 
limited  to  one  digit  position.  Standard  number  systems  require 
that  the  carry  into  a  digit  be  known  before  the  carry  out  can  be 
generated.  SDNR  does  not  require  such  foreknowledge,  so 
arithmetic  is  performed  in  parallel.  A  signed  digit  number  set 
of  radix  r  would  consist  of  the  digits 

€  |k-1 ,k-2 ,***,1 ,0,-1 ,  *  *  * ,-(k-2) , -(k-1 ) }  ,  where  k  £  r. 

This  digit  set  represents  a  redundant  number 

1  ■  ’V"'1  *  In-lr°‘2  *  '  *  *  *  x2r  *  *1  ' 

Vhen  this  type  of  number  system  is  used  in  an  arithmetic  unit, 
propagation  of  carry  is  limited  to  two  digit  places.  Since  the 
carry  does  not  propagate  down  the  entire  length  of  the  number,  an 
addition  is  done  in  constant  time,  independent  of  the  length  of 
the  operands,  n. 

The  radix-two  suLbset  of  SDNR  is  called  signed  binary  number 
representation--SBNR.  2  (It  is  also  referred  to  as  redundant 
binary  representation.)  The  digits  of  SBNR  belong  to  the  set 

6  h  ,o,T} 

where  T  represents  -1.  For  example,  the  binary  number  1011  ■ 
1 (2^)*0(22)-M  (21 )  +  1 (2°)  ■  IIjo  can  be  represented  in  SBNR  as 

101  1-11H-1  io7 

Though  circuit  implementations  of  SBNR  are  not,  in  general, 
as  simple  as  those  of  binary,  they  are  much  simpler  than 
implementations  of  the  higher  radices.  Therefore,  SBNR  provides 
a  reasonable  compromise  between  the  parallel  arithmetic 
advantages  of  SDNR,  and  the  simpler  structure  of  binary  circuits. 

In  addition,  conversions  between  binary  and  SBNR  are  simple. 
For  instance,  two's  complement  numbers  are  converted  to  SBNR 
merely  by  changing  the  sign  of  the  most  significant  bit.  Numbers 
in  SBNR  are  recoded  into  binary  by  subtracting  the  negative  bits 
from  the  positive  in  a  conventional  binary  adder.  The  system 
should  be  designed  such  that  any  slowdown  caused  by  the  binary 
adder  is  offset  by  the  speedup  offered  by  using  SBNR  arithmetic. 

Table  1  shjjws  the  addition  rules  for  _  derived 
from  Avizienis.  Six  types  of  input  bit  pattei  identified. 
Notice  that  in  all  cases,  the  carry  out  is  independent  of  the 
carry  in.  In  most  cases,  the  carry  is  independent  of  the 
previous  bits.  In  the  rest,  the  carry  depends  only  upon  the 
signs  of  the  next-lower  operand  bits.  If  both  bits  are  negative 


(Neg^l),  the  carry  out  is  the  leaser  of  the  two  operand  bits; 
otherwise,  it  is  the  greater. 

For  example,  to  add  two  SBNR  numbers  —  1 0 1 0  2  -  101Q  and 

1  1-1 ' 2  *  “9^0  —  first  the  adder  generates  the  carries,  then  uses 
the  carry  to  find  the  final  sum: 

operandl  101|0|  ■  10, 

operand2  TTjJlJ  »  -9,n 

carry  00l(l|0 


sum  0001?  *  1 1  q 


<6  >  1  1  |  1  !  .  !  1  !  Cy . 

iii  iii 


Table  1.  Truth  table  for  SBRR  addition. 


When  the  input  is  (0,1)  or  (0,1),  however,  the  signs  of  the 
next-lower  bits  must  be  known  to  generate  sum  and  carry.  Since 
the  sum  of  the  next-higher  bit  depends  on  the  carry,  and  the 
carry  depends  on  the  next-lower  bits  (and  not  on  the  next-lower 
carry),  one  can  see  that  signal  propagation  is  limited  to  two  bit 
places.  This  eliminates  the  delays  of  carry-ripple  adders  and 
the  interconnection  complexities  of  carry  look-ahead  adders. 

MULTIPLE  VALUED  LOGIC 

The  last  decade  has  seen  a  growing  interest  in  the 
computational  applications  of  multivalued  logic  (MVL).  ~  0  Much 
„  f  the  motivation  for  this  research  stems  from  the  reduction  in 
area  and  pinout  offered  by  MVL.  An  increase  in  the  number  of 
logic  levels  on  a  single  wire  causes  a  decrease  in  the  number  of 
wires.  This  decrease  reduces  both  the  interconnection  area  on  a 
chip  and  the  number  of  pins  required  to  transfer  data  to  the  rest 
of  the  MVL  system. 

Since  there  are  three  logic  levels  in  SBNR  (l  ,0,1 ) ,  the  best 
MVL  system  to  use  should  have  three  states.  Those  technologies 
in  which  logic  levels  are  represented  by  current  (I2L,  ECL)  or 
charge  (CCD)  hold  the  greatest  promise  for  radices  higher  than 
four.  Voltage-mode  logics  (CMOS,  NMOS,  TTL)  seem  better  suited 
to  the  lower  radices. 


VLSI  IMPLEMENT ATI OH  OF  MVL 

Only  the  low-power  families  (CMOS,  NMOS,  CCD)  will  allow  a 
great  enough  packing  density  for  VLSI.  Of  these  families,  CMOS 
circuits  have  received  the  most  attention.  Unfortunately,  it  is 
difficult  to  transfer  the  advantages  of  binary  CMOS--high  speed 
and  low  static  power  dissipation-- to  MVL  implementations.  Early 
designs  required  resistors  to  generate  the  center  voltage,  which 
slowed  down  the  gates  and  increased  the  power  dissipated. 
However,  some  of  the  more  recent  research'  ”  ^  has  succeeded  in 
creating  low-power  all-transistor  designs,  at  the  cost  of  higher 
gate  complexity.  The  circuits  operates  on  voltages  of  (+5,  0,  -5) 
volts  to  achieve  logic  levels  (l,  0,  T). 


Tnese  gates,  ana  their  truth  tables,  are  summarized  in 
Figure  1.  The  circuit  configurations  of  the  ternary  NAND,  NOR 
and  inverter  are  all  identical  to  their  binary  counterparts. 
Their  functional  difference  is  a  result  of  changing  the 
parameters  of  the  transistors.  The  T-gate  (transmission  gate) 
selects  one  of  three  inputs  based  on  a  control  signal.  The  S- 
gate  (switch  gate)  consists  of  two  pass-transistor  pairs  and  an 
inverter:  when  the  control  signal  goes  low,  one  input  is  passed 

through  its  shorted  pass  transistors  to  the  output;  when  the 
control  is  high,  the  second  set  of  pass  transistors  shorts  to 
allow  the  other  input  through.  The  X'  and  X"  gates  add  one  and 
subtract  one,  respectively,  mod  3* 

These  gates  are  combined  to  form  a  ternary  adder  cell  which 
uses  SBNR  to  perform  the  function 

Ci"W 

The  state  signal  NEG^  tells  the  cell  when  one  of  the  next-lower 
digits  is  negative.  It  is  the  negative  ternary  NAND  of  A^  and 
Bi«  Thus,  when  an  operand  is  negative,  NEGi+1»1,  otherwise, 
NEGi+1«1.  This  signal  controls  the  sum  and  carry,  as  shown  in 
Table  1  .  The  logic  diagram  of  the  arithmetic  cell  is  pictured  in 
Figure  2. 


Figure  2.  Ternary  odder  ceU 


•r  * 


The  implementation  of  the  cell  requires  112  transistors  with 
+  5  V  power  supplies,  and  the  longest  signal  path  is  11  transistor 
delays,  t,  long.  The  ternary  gates  used  have  been  tested  using 
four  micron  design  rules.  In  that  case,  each  transistor  delay 
was  about  1.5  nsec.  If  the  gates  could  be  implemented  using  a 
gate  width  of  one  micron,  each  transistor  delay  would  be  about 
0.1  nsec'-5.  Thus  the  total  time  from  the  appearance  of  the 
operands  to  the  appearance  of  the  answer  could  be  as  low  as  1  .1 
nsec.  Since  the  addition  is  totally  parallel,  a  row  of  these 
cells  operating  simultaneously  could  add  two  n-width  words  in  the 
same  1.1  nsec  as  a  single  bit. 

Conventional  ripple-carry  adders  (length  n)  consist  of  n 
full-adder  cells  connected  in  parallel.  Carry  from  the  addition 
of  the  first  bit  moves  to  the  second,  where  it  is  used  to  find 
the  carry  to  the  third.  The  minimum  celj.  for  this  operation  has 
a  gate  delay  of  two  transistor  delays.  ^  Thus,  the  total  add 
time  is  2nt .  The  SBNR  adder  is  a  factor  0.13n  faster  than  the 
ripple-carry  adder,  for  n>5* 

Brent  and  Kung1^  considered  a  scheme  for  implementing  carry 
lookahead  adders  in  VLSI.  In  this  paper,  they  demonstrated  a 
network  that  computed  a  length-a  sum  in  (4t)log2a  seconds.  This 
setup  is  faster  than  the  carry-ripple  circuit,  but  slower  than 
the  SBNR  adder  by  (O.JfiJloggn  for  n>6. 

The  area  complexities  of  both  the  SBNR  adder  and  the  ripple 
carry  adder  are  0(n)  (though  the  SBNR  adder  requires  about  three 
times  the  area  of  the  ripple  carry  adder).  The  area  of  the 
carry- lookahead  adder  is  0(n  loggn)  .  Thus,  the  area-time 
complexities  for  the  adders  are: 

SBNR  ------  0 ( n] , 

Ripple-carry  -  -  0(n2), 

Carry- lookahead-  0(n  log2n) . 

The  lowest  of  these  is  the  SBNR,  making  it  the  logical  choice  for 
implementation  in  VLSI.  Furtner  study  is  needed  to  determine 
whether  the  three-level  system  is  desirable.  A  binary-encoded 
SBNR  system  might  provide  the  same  area-time  advantages  as 
ternary  while  eliminating  the  negative  power  supply. 

SUMMARY  AID  CONCLUSIONS 

Use  of  SBNR  to  eliminate  carry  propagation  has  resulted  in  a 
totally  parallel  adder.  The  adder  has  been  shown  to  be  faster 
than  conventional  binary  implementations.  A  comparison  of  the 
speed  increase  provided  is  given  in  Table  2.  In  high-precision 
systems,  use  of  SBNR  can  provide  a  significant  increase  in 
processor  speed  and  system  throughput. 
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Table  2.  Number  of  gate  delays  for  various  word  lengths. 

One  disadvantage  of  SBNR  is  the  added  complexity  of  the  bit- 
level  adder  circuits.  Another  problem  is  conversion  back  to 
binary.  The  usual  method  is  to  separate  the  poitive  digits  from 
the  negative,  and  then  add  the  two  numbers  in  a  conventional 
adder.  This  brings  back  the  old  problem  of  carry  propagation. 
However,  a  new  method  proposed  by  Chen  '  converts  an  SBNR  number 
to  binary  in  a  time  proportional  to  the  longest  string  of 
consecutive  zeros  in  the  number.  This  method  could  make  the  SBNR 
adder  viable  for  a  larger  class  of  problems. 

In  a  system  that  uses  SBNR  exclusively,  of  course,  no 
conversion  need  take  place.  In  most  systems,  however,  custom¬ 
designing  all  components  to  operate  under  SBNR  is  too  expensive. 
Thus,  SBNR  should  be  used  only  in  applications  where  the 
conversion  overhead  is  negligible  compared  to  the  computation 
time  saved. 

Such  an  application  is  an  adder  tree  for  multioperand 
addition.  In  this  case,  a  large  number  of  operands  can  be 
added  together  in  SBNR,  converting  only  the  final  result  to 
binary.  A  similar  application  is  parallel  multiplication,  with 
parallel  accumulation  of  partial  products.1^  In  this  case,  bit- 
level  multiplications  are  performed  on  the  operands,  the  partial 
products  are  added  in  an  SBNR  adder  tree,  and  the  result  is 
converted  using  one  carry  lookahead  adder. 
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Processor  capability  for  hardware  implementation  of  Kalman  filters* 


A  Kalman  filter  is  a  set  of  nxm 
matrices  and  n  vectors  that  compute 
equations  similar  to  those  in  the  first  six 
equations  below. 


*</  +  I )  =  AxU)  +KU)lyU)  -Cf</>] 
+&/</)  m0)  =xq 


where  K(t)  Is  the  filter  gain  given  by 


£(/)  is  the  state  error  covariance,  that 
is. 


Ut)  ££||Jf</)  -*<f)][J?<f> -jt(/)]t  | 
yU-D.yU-2) . y(ig)  I 


E(t)  satisfies  the  following  Riccati  dif¬ 
ference  equation: 


zii+\)=AUt)AT+Q-mn 

[CUi)Cr+R]K(i)rn,0)=z0 


Several  published  papers  imply  that 
these  equations  can  be  computed  in 
nlog/i  time  using  array  processors. IJ  3,4 
Such  works  present  only  a  partial  pic¬ 
ture.  The  larger  question  is,  “How  many 
processors  are  needed?”  The  most 
serious  concern  is,  “What  is  the  com¬ 
plexity  of  each  processor?” 


A  reasonable  figure-of-merit  (FOM) 
should  be 


(number  of  compu-  ‘(number  of  ‘(processor 
utional  steps)  processors]  complexity) 

This  treatise  only  addresses  the  processor 
complexity  issue,  since  other  papers  have 
sufficiently  studied  the  remaining  two 
factors.  Claims  made  for  an  O(nlogn) 
bound  to  the  number  of  computational 
steps  assume  a  large  array  of  processing 
elements,  or  PEs.  In  fact,  nxm  PEs  are 
required  as  a  minimum.  This  reputed 
bound  does  not  take  into  account  the  at¬ 
tendant  communications  and  control  cir¬ 
cuitry  of  the  array  processor. 

The  complexity  of  these  0[nlogn]  type 
PEs  is  equivalent  to  that  of  a  68000  mi¬ 


croprocessor,  which  has  70K  transistors, 
multiplies  (16  x  16)  in  6.75  micro¬ 
seconds  and  adds  in  1  microsecond.  No 
conventional  processor  has  several  68000 
chips  on  a  single  substrate;  such  devices 
are  optimistically  forecasted  for  1995.  At 
present,  only  one  commercial  array 
processor  chip  is  available  (NCR’s 
GAPP,  #NCR45CG72).  This  state-of- 
the-art  array  is  a  6  x  12  set  of  PEs  with 
one-bit  (not  eight  or  16)  ALUs.  As  an 
algorithm  designer  for  this  chip,  this 
writer  has  found  no  software  tools 
available,  no  test  circuits,  and  no 
emulators.  A  study  of  support  tools 
(software  and  hardware)  for  any  array 
processor  and  a  thorough  comparison  of 
SIMD  (single  instruction  multiple  data), 
MIMD,  array  processors  and  data  flow 
machines  (which  are  SlMD-like)  are 
desperately  needed. 

The  brief  comparison  herein  shows 
that  the  current  multiprocessor  architec¬ 
tures  (e.  g.,  the  WSMR-DCM5)  are  still 
superior.  This  is  primarily  because  hard¬ 
ware  muli/div/add  cycles  are  400  times 
faster  than  any  available  or  conjectured 
array  processor  PE.  The  DCM  performs 
a  16  x  16  multiplication  in  220  nsec 
(compared  to  6.75  microseconds  re¬ 
quired  by  the  equivalent  PEs  necessary 
to  do  Kalman  filtering).  Futhermore,  a 
support  environment  with  hardware/ 
software  already  exists.  An  nxm  Kalman 
filter  would  be  executable  in  approx¬ 
imately  n2\ogn  steps  in  an  architecture 
such  as  the  DCM.  Assuming,  then,  that 
the  processor  complexity  is  equivalent, 
the  FOM  for  each  approach  (systolic  ar¬ 
ray  versus  DCM)  is 


FOM  dcm  =(/i2log«(l)(l) 
while  the  systolic  approach  reported  by 
Kailath,  Kung,  and  others  is 


FOMjrray  =iMog/i|[/!2)[l) 


FOM  arTay  =  rt  nmes  FOM  dcm 


Since  no  software  environment  exists  for 
any  array  processor,  the  DCM-like  ar¬ 
chitectures  remain  clearly  superior. 
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Concurrency  and  Parallelism  -  Future  of  Computing 

M.  Andrews  end  J.  S.  Walicki  * 


A  fotrtet:  This  paper  discuses*  paeelleliem  and  con¬ 
currency  in  the  light  of  currant  computing  practices. 

A  special  case  of  SIMD  machines,  also  known  a a 
systolic  arrays,  is  analysed.  A  new  architectural 
engine,  the  GAPP  systolic  array,  is  studied  in  the 
application  domains  of  signal  and  image  process¬ 
ing.  Also  included  are  database  and  associative  pro¬ 
cessing  eases.  Some  interesting  conclusions  can  be 
drawn  from  a  PE  which  can  also  be  viewed  as  an 
intelligent  memory. 

1  Introduction 

The  computer  age  ie  very  thort  but  rather  turbulent.  No  other 
Held  of  technology  hae  been  changing  eo  dramatically.  Increases 
in  spaed  and  efficiency  of  computing  are  unprecedented.  Com¬ 
puters  of  the  firat  two  generations  of  computer  evolution  (1940’s 
to  early  1960's)  were  primarily  used  for  data  processing.  Com¬ 
puting  strategies  applied  to  that  type  of  processing  were  simply 
extensions  of  traditional  human  processing  of  data.  That  is,  ths 
processing  was  sequential  in  nature,  but  relatively  fast  due  to 
increases  in  speed  of  execution.  This  human  oriented  model  of 
computing  wss  clearly  reflected  in  Von  Neumann's  definition 
of  a  digital  computer. 

However,  data  intensive  applications,  like  weather  analy- 
(it/prediction  and  various  aspects  of  signal  proctaeing,  pushed 
the  classical  computer  architecture  to  speed  limits  imposed  by 
an  existing  technology. 
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Therefore,  it  ie  not  surprising  that  the  future  of  computing 
ties  in  increased  exploitation  of  concurrency  in  computing  tanks. 
Concurrent  computation  implies  these,  not  necessarily  disjoint, 
clsssss  of  activities: 

psrsllel  ocurring  in  different  resources  in  the  teme  time  inter¬ 
val. 

«imaita*ee«j  taking  place  at  the  same  moment, 
pipelined  activated  in  overlapped  time  frames. 

The  highest  level  of  parallel  activities  takes  piece  among 
multiple  taaks/programa.  This  level  requires  the  development 
of  parallel  proceseable  algorithms  which  depends  on  the  effi¬ 
cient  allocation  of  limited  resources  to  individual  tanka.  The 
next  level  of  parallelism  may  occur  among  procedures  or  pro¬ 
gram  segments  within  tha  tame  program  which  requires  de¬ 
composition  of  a  program  into  multiple  tasks.  The  next  lower 
level  exploits  concurrency  among  multiple  instructions.  Such 
a  concurrency  ie  revealed  by  data  dependency  analysis.  Ad¬ 
ditionally,  vectorixalion  of  sufficiently  large  scalar  operations 
can  be  performed.  These  three  levels  of  parallelism  are  moat 
often  dealt  with  in  software  The  last  and  the  lowest  level  of 
parallelism  is  concerned  with  concurrency  of  operations  within 
each  instruction,  and  it  ia  usually  implemented  in  hardware 
We  can  identify  several  kinds  of  parallelism  even  in  tingle  pro¬ 
cessor  systems.  Parallelism  can  be  exploited,  or  induced,  by 
using  multiple  functional  units.  For  example,  CDC-6600 
has  10  functional  units  performing  arithmetic  and  logic  oper¬ 
ations.  The  units  are  independent  and  can  operate  in  par¬ 
allel  (BaergO,  H wan 84)  Parallelism  within  the  CPU  i» 
achieved,  in  both  large  processors  and  modern  microproces¬ 
sors,  by  overlapping  (pipelining)  the  fetch,  decode  and  execute 
phases.  Finally,  Overlapping  of  CPU  and  I/O  operations 
can  be  performed  by  using  separate  I/O  controllers. 

In  this  paper  we  want  to  examine  an  SIMD  archuecturi 
which  holds  promise  of  e  new  threshold  of  computer  architec 
lures  which  will  impact  the  marketplace  for  some  time.  Thr 
architecture  ia  configured  ebout  a  VLSI  primitive  cell  of  72  pro 
cessing  elements  regularly  organised.  The  term  'primitive  cell 
ie  a  misnomer  since  the  PE  contains  72  individual  ALU’s  Wi 
conjecture  that  this  VLSI  cell  it  the  forerunner  of  numerou- 
offspring  with  even  greater  singulented  computational  power 
Before  we  can  place  this  novel  device  in  the  spectrum  of  mod 
ern  processors  it  ia  neceasary  to  present  existing  taxonomies  o 
digital  processors. 


.1  Parallel  architecture* 


he  bsek  architectural  claaaaa  of  par  alia!  machine.  an 

•  pipelia*  compute r* 

•  array  processors 

•  multiprocessor  *yat*ma 

t  pip«lin«  computtr  parlor m*  overlapped  computation*  to  ex- 
luit  temporal  paralleilam.  An  array  proceaeor  achieve* 
patlal  paralleilam  by  using  aynebronixed  arithmetic  logic 
.niu.  A  multiprocaaaor  *y*tem  achievaa  asynchronous  par* 
•lleliam  in  a  tel  of  interacting  proceaeor*  with  shared  resource* 
HwaoM)  The  above  classification  i*  by  no  mean*  perfect, 
few  development*  in  the  area  of  eyetolic  systems  blur  dislinc- 
ioo  among  all  three  typee  of  parallel  *yetem*. 

Different  taxonomies  arise  depending  on  the  primary  fee- 
ure  chosen  to  distinguish  among  different  architectures.  Three 
najor  classification*  are  those  of  Flynn  (F!yn66),  Feng  (Feng77) 
tnd  Handler  (Hand77). 

1.1.1  Flynn's  classification 

This  historically  first  classification  is  based  on  th*  interaction 
uf  instruction  and  data  streams.  The  stream  is  a  sequence  of 
items  (data  or  instructions)  operated  upon  by  a  single  pro¬ 
cessor.  This  leads  to  classes  of  structurss  determined  by  ths 
multiplicity  of  tbe  functional  units  dsvotsd  to  servicing  th*  in¬ 
struction  and  data  streams.  Flynn  identifies  the  following  four 
types  of  machines: 


The  MISD  organisation  has  not  been  implemented  in  prac¬ 
tice  and  it  is  included  in  this  classification  tor  tbs  sals  of  com¬ 
pleteness.  In  tbs  MISD  concept  n  proceaeor*.  controlled  by 
distinct  instruction  streams,  operate  on  tbs  same  data  stream. 
The  output  of  one  proceaeor  becomes  tbs  input  to  tbs  next 
procssaor,  and  ihsrsfors,  this  schema  can  b*  efficiently  realised 
in  the  S1SD  structure. 

Finally,  the  MIMD  organisation  in  the  moat  challenging 
and  the  moat  promising  in  terms  of  speed  and  efficiency.  The 
proper  MIMD  architecture  consists  of  closely  interacting  pro¬ 
cessors  which  operate  on  the  data  streams  derived  from  the 
tame  data  space  shared  by  all  processor*.  It  is  possible  to  to- 
vision  tbs  system  in  which  processors  operate  on  independent 
data  seta.  This  type  of  organisation  is  called  multiple  8ISD 
since  it  is  a  set  of  independent  S2SD  machines.  Examples  of 
closely  coupled  machines  of  this  kind  are:  C.mmp,  Cray-2  and 
Cray  XMP,  Denelcor  HEP,  Burroughs  D-82S. 

1.1.2  Feng's  classification 

The  classification  proposed  by  Feng  ia  baaed  on  the  degree  of 
parallelism.  If  P,  ia  tha  number  of  bits  processed  within  th*  iM 
cycle  of  total  T  processor  cycles,  then  the  average  parallelism 
degree  is 


Sines,  in  general,  an  average  parallelism  degree  is  Its*  then  th* 
hypothetical  maximum  parallaliaro  degree  P,  th*  sltlisstien  rale 
a  can  bs  defined  a* 

.  _  *  ET«  i  p> 
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SISD  singi*  instruction  stream-single  data  stream 
SIMD  single  instruction  stream-multiple  data  stream 
MISD  multiple  instruction  stream-single  data  stream 
MIMD  multiple  instruction  stream-multiple  data  stream 


SISD  architecture  is  th*  simplest,  composed  of  s  single 
processing  unit  (PE),  *  control  unit  (CU)  and  a  memory  mod¬ 
ule  (MM).  Thie  is  also  tha  structure  of  th*  large  percentage 
of  th*  existing  machines.  Many  of  ths  uniprocessor*  (SISD 
machines)  sre  pipelined,  that  is,  instructions  are  overlapped 
in  their  sequential  execution.  Exemplary  machines  with  one 
functional  unit  are:  IBM  7090,  PDP  VAX1 1/780.  Examples 
of  existing  SISD  machines  with  multiple  functional  unit*  are: 
HIM  300/91,  IUM370/KMIUP,  CDC  6600,  CDC  Star-100,  FPS 
AIM2UU,  Cray- 1,  CDC  Cyber-205. 


SIMD  machine#  have  multiple  processing  element*  gov- 
erned  by  th*  same  control  unit  The  procesaors  respond  to  the 
same  instruction  stream  but  usually  operate  on  different  date 
subset*  The  memory  ia  ahared  and  uaually  contain!  several 
memory  module*.  This  clam  of  machine*  ia  exemplified  by  ar¬ 
ray  proceaeor.  and  systolic  array.  The  following  maefamssbe- 
long  to  th-  class  Burrough.'  OUac-IV  and  BSP,  Staran,  MPP 
(massively  parallel  proceaeor  by  Goodyear  Aerospace) 


The  maximum  degree  of  parallelism  P  is  equal  to  tbe  prod¬ 
uct  of  the  word  length  w  and  the  bit-slice  length  m.  A  sixe 
of  the  bit-slice  is  determined  by  tbe  number  of  bits  that  can 
be  processed  by  a  system  in  tbe  same  instance.  For  example, 
*  processing  unit  has  two  four-stage  pipelines,  which  yields 
8  bit-slice.  In  Fens'*  taxonomy  tbe  relationship  between  the 
wordlenglh  and  sise  of  the  bit-slice  determines  a  machine  class. 
Machines  like  Staran  and  MPP  have  a  short  wordlength  and 
very  long  bit-alica*.  On  tha  other  aide  of  th*  spectrum  are 
machines  with  s  relatively  long  wordlength  end  abort  bit-alica. 
IBM  370/168,  PDP-11  and  Cray-1  belong  to  thi*  clasa.  Accord¬ 
ing  to  Fang’s  classification  particular  mixtures  of  th*  bit-slic# 
sist  and  th*  wordlength  give  rise  to  four  classes  of  processing 
methods,  which  are  listed  below. 


WSBS  word-serial,  bit-serial 
WPBS  word-parallel,  bit-serial 
WSBP  word-serial,  bit- parallel 
WPBP  word-parallel,  bit-parallel 


Tbe  first  category  (WSDS)  include#  tbe  first  generation  com¬ 
puter.  w.tb  bit-serisl  arithmetic  Meet  contemporary  machine. 

are  of  th.t  kind,  *U>  called  word-atic.  computed  WPBPjng^ 
nifiee  the  faeteet,  fully  parallel  procuring  in  which  whole  block, 
of  bite  are  proceeeed  at  a  lime 
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1.1J  BlsdUr'*  classification 

This  rli— to  bato  on  the  dsfree  of  parallelism  end 
pipelining  promt  is  tho  hardware  of  a  computer.  Throe  hard¬ 
ware  subsystems  an  considered: 

•  control  twit  (CD) 

•  erithmetto  logic  unit  (ALU) 

•  bit- level  circuit  (BLC  •  aerial  combinational  circuitry  in 
ALU) 

A  computer  C  can  be  emifoed  the  figure  of  merit  T(C): 

.  T(C)*(KxKt,Dx&,W  xW') 

where 

K  -  thoaumberof  CPU*. 

D  m  the  aurnber  of  ALUa  (or  PEe)  under  eoatrol  of  a  CU 

W  **  the  word  length  of  aa  ALU 

W' m  the  number  of  pipeliae  stagee  in  all  ALU’a 

D' m  the  number  of  ALUa  that  can  be  pipelined  (chaining) 

K‘  w  the  number  of  CUa  that  can  be  pipelined  (macropipelin- 
iog) 

The  Handler’*  taxonomy  can  be  explained  oa  the  example 
of  the  C.mmp  multi  pro  rancor  system  developed  at  Carnegie- 
Melloa  University.  The  C.mmp  canatote  of  16  PDP-11  mini- 
computarn,  abated  memory  modutoa  and  16  by  16  craaabar  in- 
tercoanectioa  network.  The  ayetem  to  unique  beeauaa  it  can 
operate  in  varioua  coofiguratioea.  The  normal  mode  of  oper¬ 
ation  to  the  M1MD.  However,  under  control  of  aynchronixing 
unit  it  can  operate  in  the  SIMD  mode,  and  if  all  procaeeore 
an  cencaded  ao  that  they  operate  oa  the  came  data  at  re  am 
the  M18D  mode  reaulte.  All  three  operating  modae  can  be 
characteriaed  in  the  tallowing  way 

r(C.mmp)  ■  (16, 1, 16)  +  (1  x  16, 1, 16)  +  (1, 16, 16) 


3  Geometric  Regularity  and  Sygtollc  Sys¬ 
tem* 

Regularity  of  computer  otrueturae  to  appealing  not  only  for  new 
thetic  re  none,  but  it  atoo  ban  dtotinct  functional  ad  ran  tag  ee. 
Thin  fact  wan  recognised  early  in  the  computer  era  when  tbe 
concept  of  cellular  interconnection  nrrnya  wna  developed  for  the 
purpoee  of  date  twitching.  Proceeeing  element#  of  thee*  arrnyi 
wen  relatively  primitive  aad  contained  on  the  order  of  ten  tl- 
ementnry  gntee.  Such  limited  proceeding  capabUitiea  were  not 
overly  renrictive  aa  long  m  tbe  main  goal  was  twitching  of  data. 
Develop  menu  in  the  field  ofVLSI  aad  increased  internet  in  net¬ 
working  resulted  in  tbe  new  category  of  lyetems  which  combine 
concepts  of  parallelism,  pipelining  aad  interconnection  Then 
an  ao  called  ryatolte  rytteme  in  which  procaeeore  compute 


and  transmit  data  in  the  tynchrontoed  fashion.  Systolic  sys¬ 
tems  are  matun  cellular  tyatema  in  which  elementary  proceee- 
ing  unite  v  >E’»)  an  relatively  complex.  The  PE’t  an  utualiy 
placed  in  the  nodae  of  the  simple  one-  or  two-dimeniional  grid. 
Tight  coupling  end  pipelining  ability  of  the  tyctolic  tyttema  re¬ 
sult  in  constant-time  proceeeing  A  more  rigorous  description 
of  a  systolic  system  follows  (L*ia63). 

A  systolic  system  to  e  synchronous  network  of  proceaeon. 
Each  processor  to  composed  of  a  constant  number  of  Moore  me 
chinee  (state-output  FSMs)  which  an  defined  by  the  quintuple 
«J./,0,#,p)when 

Q  -  let  of  internal  state* 
f  -  let  of  input  symbol* 

O  •  eet  of  output  symbol* 

f  •  state  transition  function 

p  •  output  function:  o(t  +  1)  m  e(f(t  +  ])) 

In  pun  systolic  systems  only  Moon  machines  are  allowed.  In¬ 
clusion  of  Mealy  machines  produce*  eemt-systolic  systems.  If 
Mealy  machine*  wen  included  end  connected  together,  then 
logic  signals  could  ripple  through  several  machine*  in  one  clock 
period.  The  exclusion  of  Mealy  machine*  is  important  because 
it  guarantees  that  the  clock  period  dose  not  grow  with  system 
■ise.  Thus  e  clock  period  becomes  e  measure  of  time  which  is 
independent  of  system  site  The  structure  of  a  systolic  system 
S  to  given  by  a  machine  graph  G  *  (V,E),  where  V  to  a  set  of 
Moon  machines  and  E  is  n  set  of  edges  linking  the  machines. 
The  neighborhood  of  a  machine  v  6  V  is  the  set  of  machines 
with  which  it  communicatee: 

Ntifh(v)  at  (w|(v,w)  €  Eor(w,v)  €  E) 

For  S  to  be  systolic,  it  it  required  that  the  Moor*  machines  be 
small  in  the  senes  that  Q.,/,,0.  and  Neigh(v)  are  bounded, 
which  implies  that  the  system  S  may  grow,  but  individual  pro¬ 
cessor*  may  notl  It  is  important  to  preserve  inequality  between 
a  propagation  delay  between  processor*  and  a  processor  delay. 
For  that  reason  tbe  systolic  system*  with  only  nearest  neigh¬ 
bor  connections  are  especially  attractive  since  a  propagation 
delay  is  insignificant.  Therefore,  the  global  communication  in 
systolic  system*  should  be  avoided  as  it  imposes  difficult  tim¬ 
ing  restrictions  and  contributes  to  additional  circuit  complexity 
and  area. 

On  tbe  other  band,  the  global  communication  is  convenient 
because  it  provides  an  efficient  way  for  initialising  PE*  (brod- 
casting)  and  gathering  status  information.  Fortunately,  (As 
systehc  conversion  Itmm*  due  to  Leieerson  allow*  one  to  de¬ 
sign  semi-systolic  systems  (with  broadcasting)  and  implement 
equivalent  systolic  structure*. 

3.1  Application  of  Systolic  System* 

Computational  task*  art  usually  are  compute-bound  or  I/O- 
bound.  In  general,  if  the  total  number  of  operation*  is  lerger 
than  the  total  number  of  input  and  output  elements,  then  the 
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computation  ia  compute- bound.  Thu  systolic  system  solve  ef- 
fiutolly  compute- bound  problem*.  A  large  apuctrum  of  prob- 
lomu  ha*  buan  attached  using  tha  ayatoiic  approach.  The  fol- 
baii|  ia  a  brief  liat  of  major  type*  of  application: 

Matrix  arithmetic: 

a  matrix-vector  multiplication 
a  matrix-matrix  multiplication 
a  matrix  triangulisation  (aolution  of  linear  ayatama) 

•  QR  dacompoaitioa  (eigenvalues,  laaat-aquara  compute 
tiooe) 

a  aolution  of  triangular  linear  ayatama. 

Non-numeric  application*: 

a  data  atructuraa  (atacka,  queue*,  priority  queuaa,  marching 
and  aorting) 

a  graph  algorithm* 

•  Laaguag*  procaaaing  (etring  matching,  regular  axpraaaiona) 
a  dynamic  programming 

•  encoder*  (polynomial  diviaioa) 
a  relational  data  beau  operation* 

Finally,  anormoualy  important  clam  of  real-time  applica¬ 
tion*  contain*  primarily  digital  aignal  procaaaing  tanka.  It  ia 
worth  noting  that  tha  ayatoiic  ayatama  allow  real-time  (or  near 
that)  implementation*  of  powerful  aignal  procaaaing  algorithm* 
(LMS  and  Kalman  adaptive  filtering).  Soma  of  tha  aignal  pro- 
canning  application*  are  (UrquM,  Canute,  Ward!*,  SpeiM): 

•  FIR,  HR  filtering 

•  ID  and  20  FFT 

•  ID  and  2D  convolution  and  correlation 

•  median  filtering 

•  adaptive  filtering 

The  next  taction  praaaota  tha  exiating  72-proceaaor  eyetolic 
array  and  rliarnaas*  practical  tepee t*  of  it*  utilixation. 

3  GAPP  -  bit-level  eyetolic  array 

Tb*  GAPP  •  Geometric  Arithmetic  Parallel  Procaaaor  (NCR4SC 
may  b*  conaidertd  to  be  tb*  forerunner  of  many  new  eyetolic 
proceming  element*  However,  it  in  tb*  firet  device  to  rtcognix* 
tha  value  of  bit-level  implementation*.  Any  eyetolic  array,  in 
order  to  be  effective,  ehould  have 

•  complete  and  vary  regular  PE'e 

•  aa  many  PE'e  on  a  chip  a*  poaaibla 


*  call  arithmetic  function*  which  can  be  performed  in  on* 
cycle 

*  nearaat  neighbor  communication*  only 

The  GAPP  ia  juat  euch  a  chip.  It*  call  feature*  era  illua- 
trated  in  Fig.  1.  Her*  ,  w*  am  4  PE’e  with  nearaat  neighbor 
coupling  and  a  aingia  global  broadcast  tin*.  Them  hum  reduce 
tha  VLSI  interconnect  apace  to  a  nelietir  amount. 

Of  particular  note  ia  tb*  ability  of  the  GAPP  device  to  be 
caacadad.  That  alone  make*  the  GAPP  ngnificantly  power¬ 
ful.  Several  GAPP  application*  have  been  reported  which  can- 
eade  multiple  GAPP  devices  in  image  signal, and  data  proceed¬ 
ing  taaka.  Furthermore,  each  PE  ha*  an  auloaomou*  128- bit 
RAM.  If  caacadad  properly,  a  GAPP  array  can  not  only  pro 
ceea,  but  aiao  perform  frame  buffering  poaaibly  at  video  rata*. 
Thia  i*  important  to  many  image  procaaaing  application*  where 
frame  buffering  ia  required.  In  that  caae,  the  caacadad  GAPP* 
•tor*  a*  well  aa  proceee.both  aimultaneoualy.  In  practice,  a  de¬ 
al  guar  ehould  view  a  GAPP  array  a*  intelligent  memory.  Thia 
than  open*  tha  daaign  apace  to  even  larger  op  port  uni  tie*. 

Each  procaaaing  element  ia  the  GAPP  array  conaiata  of  a 
bit-aerial  ALU,  128  x  1  individually  eddraaaabl*  RAM  and  4 
aingl*  bit  latches.  The  I/O  latch  allows  communication  through 
the  PE  without  interrupting  the  ALU,  and  the  remaining  latches 
hold  input*  to  tha  ALU.  Tb*  GAPP  operate*  aa  a  8IMD  ma¬ 
chine,  that  ia  instructions  are  broadcast  to  each  cell  from  aa 
external  control  store,  loaded  ia  turn  from  tha  boat  computer. 
Proper  address  sequencing  can  be  provided  by  any  general  ad¬ 
dress  sequencer.  The  instructions  directed  to  the  processing 
elements  ronaiat  of  a  IS- bit  control  field  which  specific*  the  ar¬ 
ray  connectivity  and  arithmetic/logic  operations,  and  a  7-bit 
RAM  address.  These  instructions  can  be  sent  to  the  GAPP 
array  at  tha  rate  of  10  MIPS.  The  whole  array  has  a  global 
broadcast  of  input  data  and  a  global  output.  One  GAPP  ar¬ 
ray  device  can  function  aa  a  modular  component  in  a  larger 
array,  thus  enabling  word  or  bit- length  growth  of  an  array  aa 
needed.  Computations  are  performed  in  bit-earial  arithmetic. 
All  primitive  arithmetic  operation*  exacute  ia  a  single  procae- 
aor  cycle.  Each  prncaosnr  accepts  data  from  RAM,  from  each 
of  ita  four  nearest  neighbors,  or  from  constant  data  provided  by 
the  instruction.  A  carry  latch  allows  the  extension  of  bit-aerial 
computation  to  uaar  defined  fixed  length  words  (G*pp84a). 

From  the  point  of  view  of  the  interconnection  network  tax¬ 
onomy  the  GAPP  device  can  be  described  a*  having  tha  syn¬ 
chronous  operation  mode,  centralised  control  strategy,  circuit 
■witching  methodology,  and  regular  and  static  network  topol¬ 
ogy.  la  Flynn  'a  classification  it  ia  a  SIMD  device.  It  can  be  clas¬ 
sified  aa  a  word-parallel,  bit-aerial  (WPBS)  machine  according 
to  Feng'a  taxonomy,  and  the  Handler’*  figure  of  marit  ia 

T{OAPP)  »  (12  x  1, 12  x  12, 1  x  1) 

5.1  Applications  of  GAPP  Procaaaor 

5.1.1  Adaptive  Filtering 

Classical  technique*  aimed  at  increasing  a  aignei-to-ooiae  ratio 
(SNR)  usually  employ  information  derived  from  a  tingle  sig¬ 
nal  sanaor.  No  additional  information  is  provided,  other  then 
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the  signal  tad  perhaps  statistical  properties  of  both  signal  tnd 
noiaa.  However,  chances  for  the  raatoralioo  of  tbt  ori|inal  sif- 
aai  eta  ba  increased  if  multipla  measurements  of  tha  signal 
wave  tra  available.  Aa  array  of  senaori  providaa  aueh  to  op¬ 
portunity.  Tha  array  gaoarataa  parallal  atraama  of  data  and 
tha  information  in  thia  form  can  ba  processed  using  tha  GAPP 
arraya.  Processing  of  signals  arriving  from  tha  aanaor  array 
can  ba  accompliahad  moat  efficiently  if  p  roc  taring  method  ia 
adaptive  in  nature.  Tha  laaat  maan  aquart  (LMS)  algorithm  ia 
auch  a  method.  It  ia  poaaibla  to  partition  tha  LMS  adaptation 
rule  into  the  baaic  operationa  auitable  for  tha  procaaaor  array 
implementation  (AndrSSc). 

Operation  t:  weight  input  vector 
Operation  I.  turn  partial  producta  to  get  y(k) 

Operation  i:  compute  error  a(k) 

Oparatten  j:  acale  error  2A,t(k)  =  C 
Operation  S:  update  weighta  w  •—  w  +  Cw 

It  can  bo  aeon  from  tha  above  liat  of  operationa  that  thay 
can  ba  grouped  into  two  diatinct  phaaaa  -  compute  and  update- 

The  baaic  procaaeing  ate  pa  lie  ted  above  require  a  variety 
of  elementary  matrix  operationa.  Operation  1  requiree  multi¬ 
plication  of  two  veetora.  Operation  5  calla  for  multiplication 
of  a  vector  by  ecalar  and  for  addition  of  two  veetora.  Oper¬ 
ationa  3,  3  and  4  require  operationa  on  ecalara.  The  valuer 
to  ba  operated  on  are  represented  by  k-bit  words.  Existing 
procaaaor  (systolic)  arraya  poaaaa  a  aerial  architecture,  mainly 
because  it  ia  still  prohibitively  expensive  to  build  fully  parallel 
single-wafer  multi- procaaaor  array.  Therefore,  tha  operationa 
mentioned  above  have  to  be  performed  at  the  bit  level. 

Efficient  use  of  the  processor  array  ia  important  in  order  to 
make  up  for  looses  caused  by  tha  use  of  aerial  arithmetic.  Aa 
Urquhart  and  Wood  (Urqu84)  show  tha  array  utiliiation  de¬ 
pends  on  properly  feeding  tha  ale  manta  to  be  processed  Par¬ 
ticularly,  if  one  matrix  of  arguments  ia  kept  static  on  the  pro¬ 
cessor  array  and  the  other  matrix  is  entered  properly  skewed, 
then  the  array  is  100%  efficient.  In  our  case  it  is  quite  natural 
to  keep  the  coefficients  w  fixed  in  the  array,  while  bit-atreams  of 
input  samples  x  march  in  from  the  array  sensors.  The  arrange¬ 
ment  described  above  ia  a  basis  for  the  implementation  of  tha 
first  operation,  namely  formation  of  the  inner  product  wrx.  In 
the  next  step  partial  products  are  summed  yielding  an  output 
sample  y.  In  tha  third  step  the  error  sample  is  generated  by 
subtracting  (in  a  single  adder)  the  y  stream  from  the  d  stream. 
The  d  stream  is  a  serialised  sample  of  a  reference  signal.  The 
obtained  value  of  the  error  ia  scaled  ia  the  next  operation  (4). 
This  ia  accomplished  easily  by  properly  shifting  e(k).  The  last 
step  involves  multiplying  a  vector  by  a  scalar  and  addition  of 
vectors  (bit  matrices).  The  first  operation  in  this  tup  ia  the 
multiplication  of  the  weight  vector  w  by  the  scaled  error  e(k) 
Since  all  weigh U  are  sealed  by  the  same  scalar  value  it  amounts 
to  shifting  original  weighu  by  a  number  of  places  determined 
by  the  value  of  2A«e(k).  The  last  step  requires  adding  the  old 
weighta  to  the  sealed  values  obtained  ia  the  previous  operation 
These  steps  complete  the  update  phaaa  and  a  new  input  sample 
ia  procaaaad  aa  in  the  first  step. 

If  the  least  significant  bits  of  x  and  w  internet  first  then 
partial  producta  of  equal  significance  leave  the  edge  of  the  array 


at  the  same  time.  These  partial  products  can  be  accumulated 
using  the  adder  tree  constructed  from  MSI  adders.  If,  however, 
the  least  significant  bit  interacts  with  the  moat  significant 
bit  of  w  then  the  partial  products  of  equal  significance  appear 
skewed  and  then  the  linear  chain  of  full  adders  can  be  used 
to  accumulate  the  final  output.  If  the  second  option  is  chosen 
then  a  single  row  of  PE’s  of  a  second  processing  array  can  be 
used  to  accumulate  partial  producta. 

The  second  array  implements  operations  2,  3  and  4,  that  is 
the  computations  of  the  output  sample  y(k)  and  the  scaled  er¬ 
ror  2 &,e(k)  These  partial  products  are  accumulated  in  the  ele¬ 
ments  of  the  first  row.  Because  of  the  additional  skewing  of  the 
partial  producta  leaving  the  first  array,  the  delay  is  needed  be¬ 
tween  the  accumulating  PE's.  In  the  GAPP  device  thia  can  be 
easily  accomplished  by  storing  the  elementary  sums  in  the  local 
RAM  locations.  An  error  value  e(k)  is  computed  by  subtract¬ 
ing  the  output  sample  from  the  reference  sample  d(k)  which 
is  shifted  into  the  fourth  row  of  the  array  simultaneously  with 
the  loading  operation  of  the  third  row.  The  result  of  this  oper¬ 
ation  remains  in  the  fourth  row  and  ia  shifted  according  to  the 
value  of  2A,.  The  quantity  obtained  in  thia  step  is  the  scaled 
error  value  and  it  is  sent  to  the  host  controller.  The  controller 
uses  this  value  to  scale  (shift)  the  original  coefficients  residing 
in  the  main  array.  However,  before  thia  operation  is  performed 
the  original  weights  must  be  saved  in  the  RAM  locations.  The 
scaled  weighta  are  also  uploaded  into  the  local  RAM,  and  then 
both  quantities  are  summed  (w  —  w  +  2A,w)  yielding  the  new 
updated  coefficients,  which  are  used  to  compute  a  new  output 
sample  and  the  process  ia  repeated. 

3.1.3  Hardware  Database  Machine 
A  database  machine  should  have  the  ability  to: 

s  support  simple  and  complex  queries 

s  provide  JOIN  and  SEMI-JOIN  macros 

s  order  data  rapidly 

s  invoke  fixed  and  variable  length  record  format 

If  the  GAPP  is  cascaded  as  a  set  of  basic  building  blocks  as 
shown  in  Fig  2.  The  aggregate  system  forms  an  efficient  and 
regular  database  machine.  Much  of  the  normal  software  opera¬ 
tions  are  handled  directly  in  hardware  The  comparator  block 
accepts  an  input  comparand  or  Record  B  as  6-bit  wide  words 
on  the  CMS  lines.  Input  record  A  is  entered  on  the  S  (south) 
lines.  The  comparand  is  stored  in  EW  registers  Any  time 
the  data  stream  matches  the  comparand  the  GO  (global  out¬ 
put)  line  goes  high.  At  5  million  characters  per  second,  this 
exact  match  operation  searches  text  files  for  specific  words  at 
blinding  speeds. 

3.1.3  GAPP  as  Associative  Device 

An  associative  memory  operates  on  the  basis  of  matching 
the  contents  not  the  address  of  the  information  The  associa¬ 
tive  search  can  be  accomplished  efficiently  if  all  memory  cells 


an  checked  for  a  desired  information  ia  parallti.  The  associa¬ 
tive  (or  coeUot  sririreaashb)  memory  coo  bo  built  from  GAPP 
devices  (Wallfi4). 

Tbo  associative  memory  unit  eonaiou  of  on  associative  array 
and  on  associativa  array  controller  uoit.  Tbo  associative  array 
ie  composed  of  collr,  oocb  with  a  tag  bit.  If  tbo  tag  of  a  coll  ie 
oot  than  tbo  coll  ie  a  responder  Each  coll  hold*  ooo  entry.  The 
cello  perform  specific  function*  •  compare,  road  aad  write. 
Tbo  control  unit  bae  taro  important  rogieten  •  comparand 
aad  meek  ragietere. 

The  GAPP  device*  combined  with  an  appropriate  control 
atructur*  form  a  complete  aaaocialiv*  array.  The  associative 
array  coneiat a  of  a  caacaded  array  of  GAPP  devices.  Each  ele¬ 
ment  of  GAPP  i*  utilised  aa  a  biueerial  eeenriativ*  cell.  Since 
each  element  ie  aerial  in  nature,  multibit  word*  are  emulated 
by  uaing  tbe  local  llfi-bit  RAM  which  alao  eervee  a*  storage 
for  aa  iatarmediat*  operetiooe.  The  NS  (north-aouth)  register 
performs  various  data  function*  and  store*  the  tag  bit.  The 
tags  are  combined  and  form  the  Global  Output  signal  provid¬ 
ing  a  responder  signal  to  the  control  unit.  I/O  function*  are 
performed  through  tbe  CM  bus.  Input  data  is  uaually  loaded 
as  word-serial,  bit-parallel  but  operation  of  tbs  array  ia  word- 
parallel,  bit-aerial,  therefore,  the  comer  terniag  operation  oa 
data  must  be  performed  which  could  be  don*  uaing  GAPP  de¬ 
vices  or  specialised  circuitry.  Global  input  to  every  cell  in  the 
array  ia  easily  accommpliebed  by  using  tbe  op  cods  Unas  to 
command  the  C  register  to  load  1/0. 

Control  routines  for  beak  functions  (compare,  write,  read) 
nr*  written  in  tbe  register-transfer  like  language.  From  thee* 
basic  function  more  complex  operations  can  be  build,  like  ‘exact 
match',  ‘limit  search ’  and  ‘maximum  valua  search'  In  this 
specific  implementation,  an  •  bit  'exact  match’  search  of  entire 
array  is  performed  in  4.4  microseconds,  and  1  bit  read  of  a 
responder  can  be  accomplished  in  3.2  microseconds. 

3.1.4  Image  Manipulation 

The  GAPP  device  was  originally  conceived  for  the  purpose  of 
image  proreeeing  so  it  is  aot  surprising  that  lbs  moat  sue  caw 
ful  application*  of  tbs  GAPP  are  in  that  domain  (Gapp44b). 
The  GAPP  architecture  ie  rwlically  different  from  convealial 
image  proreeeing  structures  These  systems  required  n  frame 
buffer  to  store  an  image,  a  high  spaed  image  processor,  and 
another  frame  buffer  far  storing  the  processed  image.  Tbe 
memory-processor  transmission  bandwidth  limit*  throughput 
of  coavenUonel  systems.  GAPP  deals  with  this  problem  by 
providing  on*  processor  per  pixel.  Thus,  all  pixel*  can  be  pro¬ 
cessed  in  parallel.  The  local  processor  RAM  offers  the  possi¬ 
bility  of  eliminating  frame  buffer*  entirely,  instead,  sufficiently 
long  aerial- to- parallel  shill  register*  on  input  and  output  sides 
of  the  processing  array  can  be  used.  The  S1PO  registers  may 
be  elan  built  from  n  raw  of  GAPPa.  Tbe  shift  ragietere  ere 
long  enough  to  contain  on*  full  video  tin*  which  is  shifted  into 
the  GAPP  array  during  the  bo  moots!  retrace  Bite  of  each 
pixel  are  stored  during  each  cycle  ia  the  local  RAM  cells.  Each 
processor  location  ia  read  into  the  CM  regie  ter  prior  to  each 
CM=CMS  shift  so  that  the  6 ret  video  line  is  shifted  up  and 
written  into  the  next  higher  row  of  PE*  as  each  new  video  line 
if  fed  into  the  bottom  row  of  processing  elements.  The  GAPP 


baaed  imaging  system*  have  aa  ability  to  piocem  information 
ia  real  time. 

4  Futur*  of  SIMD  Processing  -  Conclu¬ 
sions 

W*  have  presented  both  theoretical  aad  practical  aspect#  of 
parallel  computing,  specifically  computing  baaed  oa  SIMD  struc¬ 
tures.  As  technology  ranches  it*  physical  raductioa  limits,  the 
structural  changes  ia  computing  become  more  and  more  cru¬ 
cial.  It  is  obvious  by  now  that  real-time  professing  of  images 
and  other  complex  signals  cannot  be  accomplished  without 
structural  concurrency  aad  parallelism. 

SIMD  mtohiaa*  of  the  futur*  will  have  many  bit-iatenaiv* 
like  ft  at  urea.  However  thee*  features  will  remain  transparent 
to  tha  user.  W*  are  just  now  discovering  tbe  rich  coupling 
among  matrix-manipulation  for  multiplication  especially  whan 
we  examine  tbe  word-  and  bit- level  operation*.  McCanny  et  *1 
(CaanSS)  have  reported  that  moat  of  the  currant  reaaarcb  effort 
baa  been  expended  at  the  word  and  system  level  for  systolic  ar¬ 
ray!.  However,  they  thaw  that  tbe  systolic  array  approach  at 
tha  bit  level  ie  nearly  identical  to  that  at  the  word  leval.  Most 
importantly,  they  show  that  many  important  signal  process¬ 
ing  and  data  processing  applications  can  be  implemented  using 
tha  regular  structure  of  one  or  two  primitive  cells.  Andrews 
(AndrtSa.b)  has  further  shown  that  such  similarities  between 
the  bit-  and  word-level*  make  carry /borrow  distances  short¬ 
ened  if  special  number  representations  an  invoked. 

Tbe  thrust  of  these  studies  show  that  we  can  treat  such 
mathematical  intanaiva  application*  at  the  bit-level,  captur¬ 
ing  the  power  of  VLSI  along  tbe  way.  Thus  maintaining  data 
Sows  at  tbe  bit-level  have  no  significant  impedimenta  if  n un¬ 
conventional  number  system*  are  realised.  This  observation 
may  even  pave  tbe  way  for  TRIT's  or  ternary  valued  logic.  Ia  a 
recant  paper  by  Hunt  (Hurefid),  such  multi-valued  logic  (MVL) 
show*  gnat  promise  for  VLSI.  His  arguments  correctly  identify 
the  present  limitation*  of  conventional  binary  logic.  First,  aw 
are  backing  into  the  packaging  thermal  limits  of  VLSI.  Second, 
a  severe  eat  elation  of  chip  interconnection  spec*  is  occurring. 
Some  recant  estimates  indicate  that  silicon  real  estate  jura  for 
on-chip  wiring  consume*  more  than  half  of  tbe  available  die. 
A*  a  result,  are  ere  soar  judiciously  examining  tbs  denser  infor¬ 
mation  content  to  interconnection  min*  afforded  by  MVL. 

Although  at  first  w*  are  inclined  to  dismiss  bit-level  research 
aa  a  backwards  step,  current  studies  now  show  that  many  ma¬ 
trix  manipulations  make  word-level  operations  nearly  identical 
to  bit-level  operetiooe.  Non  this  basic  observation,  many  re¬ 
searchers  have  concluded  that  systolic  arrays  may  bav*  practi¬ 
cal  implications  sooner  than  expected.  It  wee  surmised  that  the 
primitive  call  ia  the  array  must  be  very  powerful  Hence,  on* 
immediately  anticipate*  e  VLSI  atructur*  with  several  hundred 
64000'i  (or  equivalent).  Now,  we  can  see  that  effective  pipelin¬ 
ing  at  lb*  bit  level  permit*  us  to  bit-level  primitive  cell*  in 
a  computationally  powerful  machine.  The  NCR  CAPP  (Geo¬ 
metric  Arithmetic  Parallel  Processor)  it  on*  such  realisation. 
Ws  contsod  that  it  bolds  promias  of  many  new  commercially 
available  eyetolic  devices  to  com*. 


Therefore,  multi-valued  logic  may  rapport  new  ud  denser 
VLSI  structures.  If  w*  continue  to  explore  the  other  number 
systems,  eves  greeter  information  deneitiee  may  be  passible 
We  can  only  conclude  that  the  lUiac  III  may  have  been  way 
ahead  of  it*  time  (it  uaed  redundant  numbera  in  the  ALU!). 

Finally,  it  can  be  noted  that  the  exiating  ayatolic  eyetema 
(like  the  GAPP)  are  really  the  Bret  generation  of  many  new  gen¬ 
eration!  of  parallel  devicae.  Additionally  to  obvioue  increaaea  in 
deaaity  (number  of  PEe  per  wafer)  and  in  computational  power, 
ww  can  alao  expect  diatribution  of  control  functiona  which  are 
now  handled  by  one  m eater  controller.  Thia  diatribution  of 
control  will  raault  in  ayatolic  devicaa  eroeaing  the  line  between 
SIMD  and  MIMD  concepte,  which  ia  alao  to  be  expected  ea  the 
creation  of  'intelligent'  MIMD  itructure  ia  probably  one  of  the 
major  goaie  in  the  daeign  of  computing  eyetema. 
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COMPARATIVE  IMPLEMENTATIONS  OF  THE  LMS 

ALGORITHM 

Michael  Andrewst 

Space  Tech  Corporation.  2224  Manchester  Court.  Fort  Collins.  Colorado  X0526.  U  S  A. 

Abstract — A  studs  ot  available  hardware  algorithms  was  made  in  order  to  design  adaptive  signal 
processors  with  VLSI.  A  suitable  model  invoking  synchrony,  topology,  and  granularity  has 
been  chosen  to  investigate  design  figures-of-merit  for  each  implementation.  At  present,  redun¬ 
dant  arithmetic  is  being  contrasted,  basically  because  carry-free  operations  are  possible  icsulling 
in  a  speed  up  This  paper  focuses  on  models  and  primitive  computational  elemenls  lor  ihe  leasi- 
mean-sijuaie  iLMS)  algorithm  embedded  in  conventional  twos  complement,  bit-serial  01  dis¬ 
tributed  arithmetic,  and  redundant  arithmetic  processors. 


I.  INTRODUCTION 

A  technology-independent  architectural  study  for  adaptive  signal  processors  is  being 
pursued.  The  basic  approach  is  to  analyze  the  hard  ware'soft  ware  trade-offs  of  con¬ 
ventional  (von  Neumann)  and  nonconventional  (parallel,  pipeline,  vector,  array,  and 
custom  processors)  in  an  attempt  to  identify  optimal  structures  lot’  order  area  x  iimei 
which  are  computationally  fast,  yet  flexible.  Regular  and  simple  interconnections  and 
geometries  among  high-performance  parallel  structures  are  sought.  A  two-step  process 
is  assumed;  first  the  sequential  algorithms  are  to  be  speeded  up  (seeking  inherent  par¬ 
allelism)  and  second,  fast  algorithms  are  to  be  mapped  onto  new  VLSI  architectures 
(via  recursion  and  pipelining).  The  purpose  of  this  research  is  to  provide  theoretical 
design  tools  and  interconnection  strategies  capable  of  achieving  real-time  implemen¬ 
tation  of  adaptive  algorithms  via  limited  user  programmable  mechanisms  ic.g. 
firmware). 

In  this  effort,  design  rules  for  establishing  implementation  techniques  which  ex¬ 
ecute  nonrecursivc  and  recursive  adaptive  algorithms  must  be  stated.  The  design  rules 
should  identify  structures  suitable  for  VLSI.  Trade-offs  between  various  hardware 
techniques  are  then  possible.  This  study  has  particular  relevance  to  the  problem  of 
realizing  a  minimized  system  architecture  for  monolithic  adaptive  signal  processors. 
VLSI  devices  capable  of  being  organized  by  the  proposed  rules  could  possess  high 
bandwidth,  low  power  attributes  in  a  microminiature  configuration  to  enhance  per¬ 
formance  of  adaptive  antennae  arrays,  spectrum  analyzers,  acoustics,  cryptography 
(adaptive  keys),  image  processing,  and  communications. 

1 . 1  Opportunity  and  the  challenges 

VLSI  technology  opens  unprecedented  opportunities  for  synthesizing  complex 
computational  algorithms  from  various  fields  of  engineering.  It  is  now  possible  to  in¬ 
tegrate  a  huge  amount  of  hardware  on  a  small  silicon  area.  However,  many  traditional 
computer  design  concepts  are  no  longer  justified  technologically,  and  (his  leads  10  the 
formulation  of  new  and  challenging  problems.  A  distinct  characteristic  of  VLSI  dev  ices 
is  that  the  data  communication  and  its  VLSI  interconnect  area  dominate  the  cost  and 
performance.  In  contrast,  traditional  parallel  processing  finds  memories  and  the  pro¬ 
cessors  dominating  other  design  factors. 

2  DESIGNING  WITH  VLSI 

VLSI  has  now  circumscribed  classical  methodologies  of  digital  system  design.  The 
traditional  criteria  of  component  count,  whether  applied  to  processors  or  to  simpler 
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devices,  are  no  longer  adequate  to  establish  a  scale  of  comparison  among  various 
solutions  to  a  given  problem.  Indeed,  number-of-elements  criteria  are  substantially 
based  on  the  fact  that  processing  elements  and  their  interconnections  are  realized  by 
different  media.  This  difference  disappears  in  VLSI,  which  "integrates"  both  pro¬ 
cessing  elements  and  their  interconnection  in  a  two-dimensional  geometry,  the  surface 
of  the  silicon  chip.  A  meaningful  figure-of-merit  is  represented  by  the  area  occupied 
by  the  total  system,  thus  capturing  the  complexity  of  both  computation  and  data  com¬ 
munication!  I  ].  As  a  result,  the  VLSI  solution  to  a  given  computational  problem  involves 
the  conception  of  an  interconnection  architecture,  its  layout,  and  the  design  of  an 
algorithm  for  that  architecture.  For  any  given  problem,  it  is  of  great  interest  to  explore 
the  trade-offs  between  the  production  cost  (area)  and  the  incremental  cost  (time)  of  a 
dedicated  circuit  develop*  I  to  solve  that  problem.  The  area  of  a  chip  can  be  partitioned 
into  interconnect  or  wire  area  At w),  gate  area  A(,g).  and  wire  pad  area  Atp).  And  it 
appears,  so  far,  that  wire  area  dominates  gate  area.  At  least  for  the  class  of  transitive 
functions  (cyclic  shifts,  matrix  product,  integer  product,  and  linear  transforms).  Vuil- 
lemin[2]  has  shown  such  VLSI  circuits  must  satisfy 

A  >  A(g)N  +  A(p)B  +  A(w)B. 

where  A  is  the  chip  area.  N  is  the  number  of  inputs,  and  B  is  the  average  bandwidth 
(bits  per  second  passing  through  the  chip  pins).  Interestingly.  Rent’s  rule[3-5]  is  par¬ 
tially  substantiated  here.  (A  Rent’s  rule  relation  states  that  bandwidth  is  an  increasing 
function  of  area.)  Vuillemin's  result  substantially  supports  claims  that  interconnect  area 
is  VLSI  expensive.  Furthermore,  circuits  based  on  a  recursive  construction  are  par¬ 
ticularly  well  suited  to  automated  design. 

2.1  The  set  o/VLSl  design  goals 

This  research  is  concerned  with  the  study  of  algorithms  for  VLSI  arrays,  and 
focuses  on  the  transformation  of  sequential/quasirecursive  programs  into  VLSI  algo¬ 
rithms.  To  do  so.  it  is  necessary  to  define  a  set  of  objectives.  Although  only  a  partial 
list,  the  following  objectives  are  considered  to  be  important: 

1.  Correctness  and  accuracy  of  the  algorithm. 

2.  Small  computation  time;  computation  time  includes  processing  time  and  commu¬ 
nication  time,  but  not  necessarily  control  lime  (which  should  be  transparent). 

3.  Small  number  and  size  of  interprocessor  communication  links  attempting  to  min¬ 
imize  the  excessive  interconnect  area  on  current  VLSI. 

4.  Modularity  and  simplicity  of  cells,  hopefully  very  similar. 

5.  Small  number  of  processing  cells  which  may  achieve  regularity. 

Of  course,  for  specific  applications  the  relative  weights  of  these  objectives  do  vary 
depending  on  many  factors  including  technology,  yields,  manufacturing  limits,  die  size, 
power,  and  speed.  Obviously,  the  accuracy  of  the  result  is  a  prime  concern  of  the 
design.  The  processing  time  results  from  the  requirements  of  the  algorithm.  Here, 
architects  often  seek  trade-offs.  In  VLSI  systems,  the  communication  time  is  at  least 
as  important  as  the  processing  time,  because  physical  distance  is  relatively  long.  De¬ 
signers.  therefore,  search  for  algorithms  which  arc  neither  computationally  nor  com¬ 
munication  saturated.  Many  researchers  contend  that  "balanced"  algorithms  can  be 
mapped  more  easily  onto  high-performance  VLSI  architectures.  For  instance.  KunglM 
has  indicated  that  interprocessor  communication  links  consume  a  great  deal  of  silicon 
area,  time,  and  energy.  In  that  event,  it  is  desirable  to  have  as  few  links  as  possible, 
and  moreover,  to  restrict  the  data  communication  only  to  adjacent  cells  which  may  be 
achieved  by  adequate  transformations  of  algorithms.  This  goal,  however,  places  a  (teas  y 
constraint  on  the  architecture.  A  higher  modularity  as  well  as  the  simplicity  of  the  cells 
lead  to  a  smaller  design  cost. 

A  hardware  model  (G.  F.  T).  Transformation  of  algorithms  can  proceed  when  a 
hardware  model  is  specified.  A  model  proposed  by  Moldovan(7]  is  useful  to  transform 
the  abstract  features  of  the  algorithm  into  the  hardware.  We  assume  the  following 
features  for  VLSI  networks. 
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1.  The  network  consists  of  a  planar  mesh  connected  network  of  processing  cells. 
Nonplanar  meshes  are  still  academic  matters. 

2.  The  cells  can  be  of  different  types  and  perform  different  functions,  but  a  minimal 
set  is  desirable. 

3.  The  interconnections  between  cells  are  buses  which  transfer  parallel  words  and 
represent  a  topology. 

4.  The  timing  operation  of  the  network  is  synchronous. 

Moldovan  proposes  an  organization  and  operation  for  VLSI  arrays  which  can  be  for¬ 
mally  described  by  a  set  of  3-tuples  (G,  F,  T).  This  set  appears  to  be  a  promising 
starting  point  because  it  takes  into  consideration  the  three  dominant  design  factors 
(topology,  granularity,  and  synchrony). 

The  network  geometry  G  refers  to  the  topology.  The  position  of  each  processing 
cell  in  a  plane  is  to  be  described  by  its  Cartesian  coordinates.  Arbitrarily  choosing  a 
small  enough  grid  makes  it  feasible  to  represent  these  coordinates  by  integers.  Inter¬ 
connections  between  cells  is  now  a  matter  reasonably  described  by  the  position  of 
terminal  cells.  These  interconnections  support  the  flow  of  data  through  a  network  of 
links.  As  with  all  current  VLSI  structures,  a  simple  and  regular  geometry  is  desirable. 

The  functions  F  associated  with  each  processing  cell  represent  the  granularity  of 
arithmetic/logic  expressions  of  a  cell.  Most  VLSI  implementations  assume  that  each 
cell  consists  of  a  small  number  of  registers,  ALU.  and  control  logic.  Several  different 
types  of  processing  cells  might  be  implemented  in  the  same  topology:  however,  one 
reasonable  design  goal  is  to  reduce  the  number  of  cell  types  and.  hence,  their 
granularity. 

The  network  timing  T  specifies  the  processing  instant  of  a  cell.  As  a  matter  of 
synchrony,  obviously,  a  correct  timing  means  that  the  appropriate  data  arrives  at  des¬ 
tination  cells  at  the  correct  time.  The  data  stream  speeds  through  a  network  defined 
as  the  ratio  between  the  distance  of  a  communication  link  over  the  communication 
time.  Often,  networks  with  constant  data  speeds  are  preferable  solely  because  they 
require  less  control  logic. 

2.2  VLSI  Figure-of-merit 

To  provide  a  meaningful  gauge  for  the  evaluation  of  a  given  design,  a  computational 
model  of  VLSI  has  been  developed  through  the  initial  efforts  of  Mead  and  Conway|S| 
and  Thompson[9];  later  refinements  have  been  proposed  by  Brent  and  Kung|10|  and 
Vuillemin[2j.  We  briefly  recall  the  model  with  amendments. 

A  circuit  is  a  graph  whose  nodes  are  I/O  ports  or  gates  connected  by  wires.  A  wire 
has  minimal  width  q  and,  at  most,  L  wires  overlap  at  any  point  (planarity);  a  gate  has 
minimal  area  and  computes  a  Boolean  function  of  two  inputs;  an  I/O  port  has  some 
minimal  area  and  each  input  bit  i*  available  just  once.  As  regards  computation  time, 
the  combined  gate  computation  and  result  transmission  (on  a  wire  between  two  nodes) 
takes  some  time  R  dependent  upon  the  technology.  One  relevant  global  time  parameter 
is  the  output  period  P  of  the  circuit,  defined  as  the  maximum  time  between  two  suc¬ 
cessive  data  passages  at  any  output  port  when  the  circuit  is  used  in  a  pipelined  fashion 
at  the  highest  data  rate.  Another  measure  is  the  time  T  which  elapsed  between  the 
beginning  and  the  end  of  one  compulation  by  the  chip  for  one  instance  (rather  than 
repeated  instances)  of  a  given  problem.  On  the  basis  of  arguments  on  the  information 
transfer  inside  the  VLSI  chip  of  area  A  realizing  the  circuit,  natural  measures  of  com¬ 
plexity  in  the  given  module  are  the  area-time  products  AP:  and  AP 

If  we  define  the  problem  size  as  the  larger  of  the  total  number  of  bits  used  to 
specify  either  the  input  or  the  output,  a  simple  argument  by  Vuillemin[2|  shows  that 
the  circuit  area  A.  period  P,  and  problem  size  ,V  satisfy  the  relationships 
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for  such  fundamental  problems  as  integer  multiplication,  merging,  cyclic  shift,  cyclic 
and  open  convolution,  and  linear  transform.  Computing  time  T  and  period  P  are  ob¬ 
viously  related  to  T  >  P  so  the  above  inequalities  imply  AT 2  >  ,V  and  AT  >  N.  Several 
of  the  above  mentioned  problems  have  been  considered  elsewhere!  10.  12.  13).  and 
circuits  have  been  proposed  which  are  optimal  with  respect  to  the  AP:  or  AP  measure. 
This  study  is  devoted  to  seeking  design  rules  implementing  adaptive  signal  processors 
which  arc  optimal  with  respect  to  area  x  time. 

3  THE  ALGORITHM  SPACE 

3.1  Adaptive  algorithms 

The  theory  of  adaptive  algorithms  is  relatively  well  developed  and  many  algorithms 
have  been  proposed)  14-35).  However,  many  of  these  algorithms  arc  computationally 
complex  and  are  really  only  suitable  for  non-real-time  implementation  on  digital  com¬ 
puters.  In  our  study,  the  algorithms  to  be  used  should  be  as  simple  as  possible  (to 
reduce  user-programmability  requirements)  and  also  be  tolerant  of  device  noises  no 
help  increase  computational  speed).  This  background  material  is  given  as  a  guide  to 
the  eventual  selection  of  a  class  of  adaptive  algorithms  suitable  for  real-time  processor 
implementation. 

In  general,  as  a  filter,  an  elemental  adaptive  signal  processor  may  be  viewed  as  a 
system  supplied  with  two  inputs,  a  signal  input  and  a  desired  output.  The  signal  is 
applied  to  the  input  of  a  FIR  (finite  impulse  response)  filter  with  a  programmable  (time 
variable)  impulse  response.  The  impulse  response  of  this  filter  is  adjusted  in  such  a 
way  that  the  filter  output  approximates  the  desired  output  as  closely  as  possible  MR 
(infinite  impulse  response)  realizations  are  also  possible  where  internal  feedforward  as 
well  as  feedback  signal  paths  exist.  For  discussion  purposes,  we  loosely  classify  the 
former  as  nonrecursive  algorithms  and  the  latter  as  recursive  algorithms. 

3.1.1  Nortrecursive  algorithms.  In  the  elementary  case  described  now  .  a  popular 
updating  algorithm  used  is  the  Widrow  least-mean-square  (LMS)  algorithm  which  up¬ 
dates  the  weight  vector  W  to  minimize  the  mean-square  error  between  some  desired 
signal  d(t)  and  the  filter  output  v ( / ) .  A  derivation  of  the  algorithm  may  be  found  in 
Ref.  )16).  Briefly  stated,  the  updated  weight  vector  H'  is  given  by  H'  =  H  -  2m  (nS 
where  H  is  the  previous  value  of  the  weight  vector,  5  is  the  signal  vector,  u  is  a 
convergence  factor,  e(t)  =  d(t)  -  >•(/)  is  the  error  output,  and  dit)  is  the  desired  or 
training  signal.  Proofs  of  the  convergence  of  this  algorithm  assuming  perfect  device 
parameters  may  also  be  found  in  Ref.  [36].  A  nonrecursive  adaptive  filler  structure  is 
displayed  in  Fig.  1. 

Currently,  the  "sliding  window"  exact  least-squares  algorithms  (also  known  as  a 
sliding  window  covariance)  has  more  superior  tracking  properties  than  the  LMS  (gra¬ 
dient)  algorithm[37).  Hence,  our  studies  on  nonrecursive  algorithms  to  be  cast  into 
VLSI  schemes  shall  include  those  from  LMS  to  the  sliding  window  formulations. 

3.1.2  Recursive  algorithms.  The  previously  described  algorithm  is  one  of  many 
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Fig  t  A  Non-recursise  uiiaplise  processor. 
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nonrecursive  algorithms.  Adaptive  HR  or  recursive  adaptive  formulations  also  enjoy 
a  wide  selection  of  algorithms.  Popular  among  these  are  the: 

1.  Stearns- While  recursive  gradient  algorithm. 

2.  Feintuch  gradient  approximation  algorithm. 

3.  Soderstrand-Gitlin  recursivclike  echo  canceler  algorithm. 

As  with  many  recursive  structures,  due  to  the  presence  of  poles,  maintaining  stability 
during  adaptation  becomes  crucial[27].  In  addition,  the  LN1S  error  surface  is  nicely 
quadratic  with  respect  to  the  weights.  The  HR  algorithms'  corresponding  performance 
surface  is  nonquadratic  and  may  contain  multiple  minima(34]. 

Let  us  formulate  a  system  identification  problem  as  in  Fig.  2.  The  transfer  functions 
are  defined  as 


A(c)  _  +  ttic  1  +  ■■•  -r  a„z  n 

Wz)  ~  I  - 

Ate)  +  <i|C  1  +  •■•  +  ci„z  " 
Biz)  =  I  -  b,z-'  -  ~  -  b,„z-"‘ 


(II 

(2) 


This  adaptive  processor  attempts  to  adjust  the  coefficients  of  H(;)  so  that  the  minimum 
mean-square  output  error  £(r:)  is  obtained. 

Update  algorithms  for  the  weights  take  on  the  form 

Hi,  i  =  H*  +  MR-'(-Vj,  (3) 


where 

H  is  an  adaptive  weight  vector. 

V*  is  the  performance  surface  gradient  vector. 

M  is  a  diagonal  matrix  of  convergence  factors, 

R  is  a  correlation  matrix  (elements  determined  by  selected  adaptive  algorithm!. 


The  weight  vector  and  gradient  vector  are.  respectively. 
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The  Stearns- White  algorithm  first  proposed  in  1 975f 32]  is  a  gradient  algorithm 
similar  to  the  LMS  technique.  However,  the  parameter  update  method  uses  a  recursive 
calculation  of  the  gradient.  This  approach  is  computationally  expensive  because  of  the 
complicated  nature  of  the  error  function  «•(/)  in  recursive  filters.  Here,  the  instantaneous 
output  error  is  used  as  a  local  estimate  of  its  own  mean  value.  The  adaptive  updates 
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Fig.  2  A  svslem  identification  problem 
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are  expressed  as 


a„(k)  = - rr—  =  a(k  -  n)  +  V  b,(k)a„(k  -  i),  (6) 

dUn  , -I 

bm(k)  =  -  =  y( k  -  m)  +  2  b,(k)b,„(k  -  /),  (7) 

where  the  instantaneous  output  error  is 

e(k)  =  >•( A )  -  >•(*)  (8) 

and  the  matrices  of  convergence  factors,  correlation  matrix,  and  gradient  vector 
become 

M  =  diag|  w0  u„Pt  ■■■  P,„  |.  i9i 

R  =  I  =  Identity  matrix.  (10) 

Vk  =  e{k)  |  a0(k)  ■■■  a„(k)bdk)  ■■■  bm(k)  \T.  (Ill 


Feintuchl35]  proposes  a  greater  simplification  to  reduce  compulations  per  iteration 
with  a  gross  approximation  of  the  gradient.  This  algorithm  is  based  on  the  assumption 
that  the  covariance  terms  which  arise  in  taking  the  expected  value  of  the  squared  output 
error  are  constants  when  differentiated  with  respect  to  the  coefficient  vector  H.  As  a 
result,  the  gradient  vector  can  be  estimated  as 

tk  =  eUlXTUh.  ( 12) 

where 

XT(i)k  =  |  uk  ■■■  uk-„yk-i  ■■■  yk-n  |.  (13) 

The  third  algorithm  is  a  modification  of  the  Gitlin  algorithm(33].  A  new  error  func¬ 
tion  is  defined  to  which  the  LMS  algorithm  is  applied.  In  all  HR  cases,  we  must  rec¬ 
ognize  that  we  are  incurring  much  larger  computational  costs  than  with  the  simple  LMS 
structure  alone.  But  there  are  strong  results  that  suggest  that  fewer  filter  coefficients 
are  needed  for  HRs.  Therefore,  fewer  multipliers  may  be  necessary,  thus  reducing  the 
computational  costs  somewhat. 

Soderstrand[17]  makes  a  revealing  comparison  among  these  choices  based  on  equal 
cost  implementation.  Here,  he  assumes  that  the  multiplier  is  the  overwhelmingly  ex¬ 
pensive  element  in  any  physical  realization,  so  his  comparisons  rest  on  each  imple¬ 
mentation  with  identical  numbers  of  multipliers.  Although  the  results  are  qualitative, 
we  can  make  the  following  observations  on  the  relative  merits  of  the  LMS  nonrecursiv  e 
algorithm  and  the  three  identified  above.  For  63  multiplier  realizations,  the  LMS  al¬ 
gorithm  performs  as  well  as  the  best  recursive  choice  (Soderstrand-Gitlin).  Unfortu¬ 
nately,  we  cannot  unreservedly  choose  an  LMS  over  recursive  methods  based  on  Sod- 
erstrand’s  work.  Even  though  he  measures  performance  of  the  algorithm  by  how  close 
their  respective  filter  parameters  converge  to  the  least-mean-square  approximations  to 
five  test  cases,  he  has  offered  no  quantitative  measures. 

3.2  Effectiveness  of  redundant  arithmetic 

Ercegovac(38]  has  proposed  fast  computational  methods  amenable  for  efficient 
hardware  level  implementation  as  viable  alternatives  to  parallel  algorithms  implemented 
and  to  implementation-dependent  algorithms,  primarily  operating  in  a  fixed-point  num¬ 
ber  representation  system.  We  can  generally  classify  fast  methods  as:  (I)  those  imple¬ 
mented  with  multiple  general  purpose  processors  and  with  the  corresponding  algo- 
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rithms,  and  (2)  those  implemented  with  special  purpose  processors  with  algorithms 
embedded  in  the  hardware  (operation)  level. 

When  considering  a  computational  method  for  possible  implementation,  we  are 
concerned  with:  (1)  the  application  domain,  (2)  the  required  set  of  algorithms,  and  (3) 
the  required  set  of  primitive  operators.  With  these,  we  can  then  associate  a  set  of 
desired  or  required  properties;  for  instance,  the  speed,  the  complexity,  and  the  cost 
of  implementation,  the  fault  tolerance  capability,  numerical  characteristics  of  the  al¬ 
gorithms,  etc.  The  objective  of  this  study  is  to  define  a  method  that  would  have  a 
sufficient  generality  to  adaptive  signal  processor  applications  and  such  functional  prop¬ 
erties  in  order  to  justify  a  hardware  level  implementation. 

Ercegovac's  method  is  compatible  in  many  respects  with  the  method  invoking 
multiple  processors  because  the  computational  algorithm  is  simple  and  problem  in¬ 
variant.  There  is  no  shift  operator,  which  in  the  previously  proposed  methods  must 
have  a  variable  shift  capability  (e.g.  distributed  arithmetic  schema  of  Peled  and  Liu|39], 
Roberts[40],  Cowan  et  u/.(4l],  and  others(42-44]).  When  a  redundant  representation 
is  introduced  in  order  to  increase  the  speed  of  computation,  a  variable  shift  operator 
can  considerably  affect  the  complexity  of  implementation. 

It  can  be  demonstrated  that  certain  arithmetic  expressions,  multiple  products  and 
sums,  inner  products,  integral  powers,  and  solving  of  systems  of  linear  equations  under 
certain  conditions  are  among  the  possible  applications.  Basic  arithmetics,  in  particular, 
multiplication  and  division,  can  be  performed  by  this  method.  Furthermore,  it  has  a 
useful  functional  property  in  that  the  results  are  generated  in  a  digit-by-digit  fashion 
with  the  most  significant  digits  appearing  first  so  that  an  overlap  of  computations  can 
be  utilized.  Equally  noteworthy  is  the  fact  that  AD  conversion  gives  the  most  significant 
digit  first  so  overlapped  signal  processing  is  possible. 

3.3  Basic  division! multiplication  in  redundant  arithmetic 

Let  us  consider  problems  of  division  and  multiplication  in  a  computational  envi¬ 
ronment  whereby  basic  arithmetic  algorithms  satisfy  an  '‘on-line‘‘  property.  In  other 
words,  to  generate  the  jth  digit  of  the  result,  it  will  be  necessary  and  sufficient  to  have 
the  operands  available  up  to  the  ( j  +  8)th  digit,  where  the  index  difference  5  is  a  small 
positive  constant.  At  first,  we  will  accumulate  8  initial  digits  of  the  operands  before 
we  can  produce  the  first  digit  of  the  result.  Subsequently,  one  digit  of  the  result  is 
produced  upon  receiving  one  digit  of  each  of  the  operands.  Remarkably.  8  is  the  on¬ 
line  delay  which  can  be  arbitrarily  small.  Such  algorithms  are  attractive  because  of  the 
inherent  speed  up  due  to  their  potential  to  perform  an  overlapped  sequence  of  oper¬ 
ations.  Fast  variable  precision  arithmetic  is  also  possible.  The  on-line  property  will 
implement  a  left-to-right  digit-by-digit  type  of  algorithm  using  a  redundant  represen¬ 
tation  for  the  results. 

Consider  an  m-digit  radix  r  number  N  =  2”- i  n'r  ~  * •  1°  Ihe  conventional  repre¬ 
sentation.  each  digit  n,  can  take  any  value  from  the  digit  set  {0.  I . r  -  I}.  Such 

representations,  which  allow  only  values  in  the  digit  set,  are  nonredundant  since  there 
is  a  unique  representation  for  each  (representable)  number.  By  contrast,  number  sys¬ 
tems  that  allow  more  than  r  values  in  the  digit  set  are  redundant,  and  often  speed  up 
arithmetic  operalions[45,  46].  Note  that  a  redundant  number  representation  may  be 
required  for  on-line  algorithms.  In  a  nonredundant  number  system,  addition  and  sub¬ 
traction  incur  a  carry  propagation  penalty.  Redundancy  limits  the  carry  propagation 
to  one  digital  position  Icf.  Ref.  (451,  an  on-line  algorithm  for  addition  (and  subtraction! 
with  8=1,  and  an  on-line  algorithm  for  multiplication  with  8=1], 

3.4  The  computational  algorithm  using  redundant  arithmetic 

Suppose  a  linear  system  of  L  equations  is  given.  An  algorithm  for  solving  a  system 
L  is  desired  whereby  an  iterative,  digit-by-digit  method  occurs.  That  is.  the  algorithm 
generates  one  digit  of  each  of  the  elements  of  the  solution  vector  in  one  step.  Some 
redundant  number  representation  definitions  are  now  appropriate. 
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Definition  1 

An  m-digit  radix  r  representation  of  a  number. r.  ]  x  |  <  1 ,  is  a  pol>  nomiul  expansion 

m 

x  =  sgn  X  ^  x,r~  1 14) 

•  -  I 

where  x,  E  D  V  i,  and  D  is  a  digit  set. 

Definition  2 

Given  the  radix  r,  a  set  of  consecutive  integers  D  including  zero  is 

(1)  nonredundant  if  its  cardinality  satisfies  j  D  |  =  r. 

(2)  redundant  if  |  D  |  >  r. 

Definition  3 

A  symmetric  redundant  digit  set  is  defined  as 


D„  =  {-p,  -(p  -  1) . -  1.  0.  I . p  -  I.  p). 


where 


ir  ss  p  s  r  -  I.  (16) 

Definition  4 

A  symmetric  redundant  digit  set  Dp  is  said  to  be 

(1)  minimally  redundant  if 

I  D„\  =  r  +  I.  (17) 

i.e.  p  =■  \r  (assuming  an  even  radix  r): 

(2)  maximally  redundant  if 

\D(,  |  =  2r  -  I,  (18) 

i.e.  p  =  r  -  I.  Let  D  and  D,,  denote  nonredundant  and  redundant  digit  sets, 
respectively.  Then  the  representation  of  a  number  .v  is  nonredundant  or  redundani 
depending  on  whether  x,  E  D  or  x,  E  Dp. 

4.  ON-LINE  MULTIPLICATION 

The  following  describes  an  on-line  algorithm  for  multiplication  which  can  be  made 
compatible  with  on-line  division.  It  is  a  type  of  incremental  multiplication  (using  digital 
differential  analyzers[47,  48],  combined  with  the  redundant  numbers). 

Let 


*  =  2  x,r'\ 


Y  =  2  (20) 

/  -  I 

be  the  radix  r  representations  of  the  positive  multiplicand  and  the  multiplier,  respec¬ 
tively.  Define 

/ 

Xj  -  2  x'r  "  =  A’,- I  1-  .v,r  -J.  (21) 


Y,  =  2  Y'r  ‘  ~  Y,-\  ~  Vji-  ' 
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to  be  the  ./-digit  representations  of  the  operands  X  and  Y,  available  at  the  ,'th  step  by 
definition  of  an  on-line  algorithm.  The  corresponding  partial  product  is  then 

X/Yj  =  Xj.t  -Y,.x  +  (X,Xj  +  Yj-  (23) 

Let  Pj  be  the  scaled  partial  product,  i.e. 

Pj  =  X,  Y/r1,  (24) 

so  that  the  recursion  of  the  multiplication  algorithm  can  be  expressed  as 

Pj  =  rPj.  i  +  Xyy'j  +  Y,-,  Xj.  (25) 

Let  P o  =  0.  Then  the  scaled  product  P,„  =  XYrm  can  be  generated  in  m  steps  of  eqn 
(25).  Note  that  if  a  nonredundant  number  system  is  used  in  representing  the  partial 
products,  the  digits  of  the  desired  product  appear  in  a  right-to-lcft  fashion,  as  determined 
by  the  conventional  carry  propagation  requirements.  However,  for  a  redundant  number 
representation,  ieft-to-right  generation  of  the  product  digits  is  possible  and  desirable. 
Furthermore,  the  execution  time  to  perform  a  recursive  step  will  be  independent  of 
the  operand  precision  because  carry-free  addition  is  possible.  We  now  essentially  de¬ 
scribe  the  applicability  of  Ref.  149|  to  our  current  work. 

We  will  use  a  symmetric  redundant  digit  set  which  is 

Dp  =  {- p .  -ip  -  1) . -1.0.  I . p  -  1 ,  p\.  (2b) 

where 

ir  «  p  «  r  -  I.  (27) 

According  to  the  general  computational  method  of  Ref.  150),  the  basic  recursion  (25) 
for  the  multiplication  is 

w,  =  rUv,- i  -  </,-,)  +  Xj-y,  +  Yj-rXj,  (28) 

where  the  digits  d,  6  Dr  arc  determined  by  the  selection  function 


dj  =  5(ir,)  =  sgn  >v,l  )  iv,  |  +  J).  (29) 

Then,  from  (25)  and  (28),  the  following  relation  can  be  obtained  by  induction: 


(30) 


Substituting)  =  m  in  (30)  and  rearranging,  we  obtain 


P,„  =  X Ti 


2  l'- 


>v,„ 


(31) 


or 


X  Y  =  2  d,r-'  +  ( iv, „  -  dm)r-m.  (32) 

i  -  i 

According  to  the  selection  function  S(iv,)  where  |  >v,„  -  d.„  |  ^  ).  2"-  i  d<r~'  ls  no"' 
the  redundant  representation  of  the  most  significant  half  of  the  product  .V  >'  Conver¬ 
gence  of  the  algorithm  is  now  guaranteed. 


M.  Andrews 


128 

Let  us  assume  that  the  selection  function  (29)  can  produce  the  digit  d,  such  that 
|  d,  |  <  p.  This  condition  is  satisfied  provided  that 

I  *■,  |  <  p  i.  (33) 

where  j  *  1.2 . m.  We  next  bound  Xi  on  the  values  of  the  operands  X  and  Y  to 

ensure  that  (32)  holds.  Let  ,  X  .  ,  Y  i  <  M-  Noting  that  t  ,  -  j,- ,  |  <  j  and  y,  . 
|  x,  |  «  p.  from  (33)  we  obtain 

it  ,  i  «  \r  -  2.V/p  (34) 

Rearranging  (32)  we  obtain 


Notice  that  in  a  minimally  redundani  system  ( p  =  r'2),  the  required  operand  bound 
must  be 


X  ,.  Y  <  7-  .  (36) 

2  r 

whereas  in  a  maximally  redundant  system  1  p  =  r  -  I).  the  required  operand  bound 
must  be 


X  .  Y  <  J.  (37) 

These  bounds  are  not  tight  In  binary  radix.  X  .  Y  :  <  }  suffices 

The  time  required  for  the  computation  ol  u  is  made  independent  ol  the  precision 
of  the  corresponding  operands  by  allowing 

I  * '  it ,  —  d.  <  I .  ( 3b ) 

which  implements  a  carry -propagation-free  addition  per  (29)  This  last  observation  is 
most  critical  to  the  redundant  arithmetic  LMS  implementation 


4  1  LMS  m anal  com  cnnons 

!n  subsequent  discussions,  all  one-dimensional  matrices  are  represented  by  column 
vectors  Boldfaced  characters  are  vectors  or  matrices  As  usual,  the  superscript  /  relers 
to  the  transpose  of  a  column  vector  or  matrix,  and  printed  vectors  represent  updated 
vectors.  (The  variable  r.  denoting  time,  is  omitted  from  subsequent  expressions,  but 
is  implicit  to  discussions.)  The  necessary  scalars  are  defined  as 


hint  =  mh  coefficient  of  an  X-poini  digital  transversal  tiller 

/(Hi  =  Hlh  partial  product  used  in  (he  oulpul  accumulation  of  j  distributed  arithmetic  lilici 
vtni  -  fi  t'll  input  signal  sample  present  ai  point  >i  ol  an  Vpoint  digital  liller 
1  3  digital  tiller  oulpul 

J  =  input  training  signal  to  digital  adapme  filter 
r  =  J  -  v  =  error  sample  generated  b>  digital  adaptive  filter 

These  scalars  form  the  following  matrices 


S1  =  I  o  1 1,  si 2 1 .  .  Out.  .  Si  N  it 

H'  =  1  In  1 1.  In  2i.  .linn.  In  \  11 
F'  =  1/1 1 1.  n  2 1 .  .  / 1*1.  .  11 A  11 

X '  =  1 2  ’ .  2  .  2  ‘  1.  1  e  the  set  ol  the  first  A  negative  integer  powers  ol  2 


B  =  the  V  *  A' arras  ol  hit  v alues  w hie n  results  w hen  a  A  Pit  input  signal  vector  is  stored  m  an  \ -point 
digital  filter  B  is  merelv  S  decomposed  into  component  Pits 
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Consider  a  set  of  N  registers,  each  of  length  K  bits  which  provides  storage  for  the 
N  x  K  array  of  bit  values.  This  array  is  represented  by  the  S  x  K  matrix  of  binary 
values  B.  We  then  view  a  classical  Appoint  transversal  filter  as  capturing  signal  vector 
S  containing  N  X-bit  components.  The  signal  vector  can  be  expressed  as 

S  -  BX.  (39) 

where  it  is  assumed  that  the  signal  is  coded  in  offset  binary,  wherein  logical  0  takes 
the  value  -  1.  The  output  of  the  filter  is  given  by  the  convolution 

y  =  SrH,  (40) 

where  the  column  vector  H  represents  the  set  of  ,V  filter  coefficients.  Define  the  column 
vector  F  as 

F  =  BrH,  (41) 

and  substituting  (39)  in  (40),  using  the  property  of  matrix  transposition  we  have 

y  =  XrF.  (42) 

where  F  is  a  set  of  partial  products. 

Cowan  el  u/.l41]  has  observed  that  the  output  filter  formulation  of  (42)  when  com¬ 
pared  with  (40)  reveals  the  essential  elements  of  the  distributed  arithmetic  architecture 
of  the  LMS  algorithm  depicted  in  Fig.  3.  Simply  stated.  Fig.  3  is  a  summation  over  K- 
bit  planes  rather  th*  ver  S  filter  points.  Only  the  basic  hardware  configuration  for 
a  fixed-response  distributed  arithmetic  filter  is  depicted  in  Fig.  3.  The  input  ADC  (an- 
alog-to-digital  converter)  signals  are  presented  serially  to  a  set  of  N  cascaded  X-bit 
shift  registers.  As  this  serial  bit  stream  enters  the  shift  registers,  the  shift  register  parallel 
outputs  generate  K  Mbit  address  words  on  the  RAM  address  bus.  Each  RAM  datum 
is  then  right  shifted  k  bits  and  accumulated.  The  accumulation  is  complete  after  K 
memory  accesses.  Finally,  an  output  sample  is  converted  to  an  output  analog  signal 
Since  the  filter  coefficients  F  are  adaptable,  the  buffer/sum  block  will  generate  the 
coefficient  updates  to  the  RAM. 

4.2  Conventional  binary  ALU 

The  LMS  algorithm  can  be  implemented  in  conventional  twos  complement  arith¬ 
metic  architectures  using  hardware  intensively  or  sparsely.  Obviously,  if  the  algorithm 
is  implemented  with  IS  multipliers  and  2 ,V  adders  (very  intensive),  no  faster  com¬ 
putation  speeds  can  be  achieved  The  on-line  delay  is  approximately  equal  to  that  of 


*(t) 


Fig  '  DisinhuirU  arnhmcu,.  jn-hiteciure 
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the  slowest  computational  element  (adder,  multiplier,  ADC).  In  practice,  we  realisti¬ 
cally  expect  a  trade-off  between  computation  speed  and  the  amount  of  hardware. 

Two  cases  are  considered,  case  l  with  a  one  adder  and  one  multiplier  element  and 
case  2  with  2  N  adders  and  2iV  multipliers  (where  <V  is  the  number  of  filter  coefficients). 
In  both  cases,  multipliers  and  adders  process  twos  complement  numbers  in  parallel  m- 
bit  fashion.  The  architecture  of  Fig.  4  is  assumed  in  either  case,  except  that  the  adders 
and  multipliers  are  replicated  2jV  times  for  case  2. 

Of  great  interest  is  the  minimum  sampling  time  T  of  each  implementation.  This 
figure-of-merit  is  a  function  of  the  following  processing  element  periods: 

T„  =  rw-bit,  twos  complement  addition  cycle, 
f»  =  memory  write  cycle  time, 

T,  =  memory  read  cycle  time, 

T,  =  one-bit  shift, 

T.„  =  m-bit  multiplication  of  two  operands. 

Assume  that  the  LMS  algorithm  invokes  the  equations 


y  =  StH 

(filter  output). 

(43) 

e  =  J  -  y 

(filter  error), 

(44) 

H  =  H  +  ueS 

(weight  update). 

(45) 

where  upper  case  and  lower  case  denote  vectors  and  scalars,  respectively.  The  filter 
convergence  rate  is  determined  by  the  scalar  it.  In  practice,  it  is  simplified  to  2* .  K  £ 
{0,  -  I.  -2.  -3..  .  .}.  We  make  the  assumptions  that  analog-to-digital  conversion  time 
T\uc  and  digital-to-analog  conversion  time  TUAt  are  relatively  short: 

T dac ,  Tadc  **  processing  time  of  slowest  computational  element.  (4f>i 


RAM 


Fig  4  Conventional  binur>  architecture 
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Let  Aa ,  Am,  and  AM  represent  the  chip  area  for  the  m-bit  parallel  adder,  multiplier, 
and  memory. 

Case  1:  One  twos  complement  m-bit  parallel  adder  and  one  m-bit  multiplier. 

T  =  2  N(7\v  +  7V  +  Tm  +  rj  +  uNT>  +  Ta.  (47) 

The  area  x  time  figure  of  merit,  FOM,  can  be  no  better  than 

FOM  s  ( A,„  +  Au  +  A\i)[2N(Tw  +  T,  +  T,„  +  Tu)  +  uNTy  +  Tu].  (48) 

Case  2:  2 N  multipliers  and  m-bit  parallel  adders.  After  an  initial  delay  (;V  +•  I 
samplings  to  fill  the  memory),  it  is  possible  to  produce  a  filter  output  at  every  sampling 
instant.  Here,  eqn  (46)  may  no  longer  apply.  Even  so.  signal  conversion  can  be  pipelined 
with  data  processing  causing  only  a  slight  penalty: 

T  =  Tr  +  Tm  +  3T„  +  r„,  (49) 

FOM  2=  2N(A,„  +  Au  f  Am)  ( Tr  +  T,„  +  37u  +  7„).  (50) 

The  FOMs  in  both  cases  arc  lower  limits  because  no  interconnect  area  is  considered. 
However,  the  bound  in  eqn  (48)  is  much  lighter  than  the  bound  in  eqn  (50).  simply 
because  case  2  utilizes  2(,V  -  1)  more  computational  elements  than  case  I. 

4.3  Bit-sequential  cell 

Sips(44|  has  proposed  a  primitive  bit-sequential  cell  and  a  linear  two-dimensional 
processing  element  for  addition,  subtraction,  multiplication,  and  division  as  shown  in 
Fig.  5.  Multiplication  is  based  on  the  well-known  technique  of  incremental  muliipli- 


Fig.  ?  Bit-vequenttdl  cell  architecture 
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cation[49]  in  which  the  MSB  is  produced  first.  Figure  5  depicts  an  individual  full  adder 
(FA)  cell  and  a  bitwise  floor  plan  to  execute  the  basic  algorithms  of  the  product  pairs 
AB  and  CD ,  essentially  using  a  systolic  array  with  one-bit  full  adders.  Processing  begins 
in  the  upper  right  comer  and  proceeds  downward  through  the  array. 

Each  cell  is  comprised  of  a  D  flip-flop,  3-input  AND  gate,  2-input  XOR  gate,  and 
a  full  adder.  The  XOR  gate  is  provided  for  obtaining  the  complemented  operand  nf  the 
operand  has  a  ne: .five  weight).  Two  operands  are  provided  to  each  cell  via  the  V 
and  "b"  lines.  Control  lines  XTL  and  XTL'  load  successively  each  new  bit  of  the 
operand  in  the  next  column  (to  perform  the  systoliclike  addition). 

The  basic  feature  of  this  architecture  configured  with  the  bit-sequential  cell  is  that 
all  operations  have  the  same  operation  time  and  that  the  operation  time  is  linearly 
dependent  on  the  number  of  bits  of  the  operands.  The  hardware  complexity  described 
in  Ref.  (51)  has  been  shown  to  be  of  O(m)  complexity. 

4.4  Redundant  arithmetic  cell 

A  digit  slice  proposed  in  Ref.  [51]  can  be  used  as  the  basic  computational  redundant 
arithmetic  cell.  The  cell  is  a  three-level  digit  slice  capable  of  implementing  addition, 
subtraction,  multiplication  [using  eqn  (32)],  and  division.  A  restriction  on  the  digit 
ranges  per  (36)  applies  here.  It  is  important  to  note  that  the  basic  LMS  algorithm 
implemented  with  these  cells  can  operate  on  the  most  significant  digit  first.  Hence,  this 
computational  feature  makes  best  use  of  an  analog-to-digital  converter  which  first  pro¬ 
duces  the  most  significant  digit,  in  this  case,  processing  and  conversion  can  be 
overlapped. 

The  cell  depicted  in  Fig.  6  assumes  that  the  input  digits  a.  h.  and  c  6  D  belong 
and  that  the  data  outputs  consist  of  a  result  digit  s  and  three  transfers  »v.  t.  and  y. 
Primed  digits  are  intermediate  transfers  between  cells.  The  cell  performs  the  following 
functions. 

I .  Product  of  b  and  t : 


ru  ■ *-  w  «-  he  where  r  is  the  radix. 


2.  Multidigit  addition  of  a,  h  u : 

rx  -*-  t  «—  a  *  w'  +  u. 


(51) 


1 52) 


a  be 


Fig  ft  Redundufu  arithmetic  cell  architecture 


Comparative  implementations  of  the  LMS  algorithm  13? 


Table  I  Comparison  of  architectural  complexity 


Conventional 

binary 

(2.X  multipliers) 
(2.N  adders! 

Distributed 

arithmetic 

Redundant 
arithmetic  cell 

Bit-sequential  cell 

Gale  complexity 

0(2  m\  i 

Oik  Vi 

Olm) 

Otml 

Latency 

V  •»  1  memory 

C.X  bit  shifts 

one  ADC  bit 

&  *  I  ADC  bn 

writes 

conversion 

conversion 

VLSI  amenable 

structure  irregular 

yes 

yes 

yes 

b  xlimutc J  pin 

technology 

technology 

1 0/FI 

4Ut 

count/cell 

sensitive 

sensitive 

*  Assumes  each  cell  is  a  4-bil  slice. 
m  m  number  of  digits  in  a  word 
k  »  number  of  bits  in  each  shift  register  i k  <  ml 
.V  =  number  of  filter  coefficients 

5  =  small  positive  constant  i3  or  4)  less  than  Tn.  the  time  to  complete  a  full  bit-parallel  operation 

3.  Multidigit  subtraction: 

rx  +  t  —  a  -  *■'  +  u.  (53) 

4.  Assimilation  (maintenance  of  signed  digit  code): 

-  ry  -t  s  «-  a  -  y'  where  y  E  (0,  1).  (54) 

The  digit  slice  is  configured  in  a  linear  one-dimensional  array.  In  left-to-right  fashion, 
any  intercell  transfers  propagate  via  the  n r' ,  and  y’  lines.  Results  appear  at  the  bottom 
of  each  cell.  With  the  floor  plan  for  the  basic  adder/multiplier  unit,  few  bridges  or  vias 
are  anticipated.  Thus,  “stray'’  transistors  caused  by  metallic  bridges  can  be  minimized. 

5  SUMMARY 

Four  architectures  are  compared  for  their  suitability  for  VLSI  implementation  of 
the  LMS  algorithm: 

1.  Conventional  twos  complement  binary  full-parallel  adder/multipliers. 

2.  Distributed  arithmetic  variation  of  (I)  using  bitwise  adders  across  the  filter  taps. 

3  Redundant  arithmetic  cells  replacing  the  adders/multipliers  of  (1). 

4  Bit-sequential  arithmetic  cells  replacing  the  adders/multipliers  of  (1). 

Table  I  lists  the  relative  complexity  of  each  architecture.  The  conventional  architecture 
has  the  overwhelmingly  highest  gale  complexity,  but  also  a  very  short  later :y.  Latency 
is  the  waiting  interval  before  actual  computed  results  appear  at  the  output.  The  con¬ 
ventional  architecture  is  also  very  general  purpose  and  can  support  other  algorithms 
much  more  conveniently  As  expected,  increasing  hardware  tends  to  expand  the  ap¬ 
plication  base  of  an  architecture  However,  no  analysis  of  control  unit  requirements 
has  been  made  A  future  paper  will  compare  control,  communications,  and  data  trans¬ 
port  requirements 

It  is  important  to  note  that  the  area  time  figure-of-merit  for  all  architectures  applied 
to  the  LMS  algorithm  comply  with  either  eqns  (48)  or  (50)  Only  the  individual  vak es 
of  .4,  and  T,  will  change  dependent  on  the  particular  technology  Each  FOM  is  very 
optimistic  since  the  interconnection  area  between  the  PEs  has  not  been  considered. 
Such  important  factors  at3  being  studied  and  will  be  reported  in  a  future  paper 

Ai  krumlrUitemr*t  —  This  *ork  has  been  supported  in  pari  b\  ARO  Research  Gram  #DAA29-8)  C-0025 
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Abstract:  Fast  and  accurate  adaptive  filtering  implemented  in  the  digital  hardware  is 
necessary  in  many  areas  of  communication  and  general  signal  processing.  This  paper 
discusses  the  realization  of  the  least  mean  square  algorithm  for  adaptive  filtering  on 
a  rectangular  systolic  array.  Two  such  arrays  are  needed  for  six  coefficient,  twelve-bit 
adaptive  filter.  Extension  to  a  larger  filter,  or  higher  resolution  can  be  easily  accom¬ 
plished  by  cascading  processing  arrays.  The  implementation  proposed  here  is  oriented 
towards  the  72-processor  array  manufactured  by  the  NCR  Corporation.  Efficient  uti¬ 
lization  of  the  processing  array  is  accomplished  by  proper  feeding  of  input  data  arriving 
from  a  sensor  array.  Processing  is  performed  at  the  bit  level  and  the  vector  operations 
of  the  original  LMS  algorithm  become  matrix  operations.  The  processing  is  divided 
into  two  phases  -  compute  and  update.  In  the  first  phase  a  new  output  of  the  filter 
is  computed,  and  filter  cofficients  are  updated  according  to  the  LMS  rule  in  the  next 
phase.  Effects  of  the  finite-word  length  are  discussed  but  with  12-bit  representation  of 
signals  these  effects  are  not  severe. 


1  Introduction 

1.1  Adaptive  Filtering  and  the  LMS  Algorithm 

It  is  well  known  that  any  transmission  of  information  is  vulnerable  to  degradation  caused  by 
various  types  of  interferences  (noises).  These  noises  may  be  accidental  like  the  RF  interference  or 
intentional  like  electronic  countermeasures.  In  any  case,  the  reduction  of  interference  is  important. 

Classical  techniques  aimed  at  increasing  a  signal-to-noise  ratio  (SNR)  usually  employ  informa¬ 
tion  derived  from  a  single  signal  sensor.  No  additional  information  is  provided,  other  than  the  signal 
and  perhaps  statistical  properties  of  both  signal  and  noise.  However,  chances  for  the  restoration 
of  original  signal  can.  be  increased  if  multiple  measurements  of  the  signal  wave  are  available.  An 
array  of  sensors  provides  such  an  opportunity.  Such  a  sensor  array  attracted  our  interests  because 
it  generates  a  vector  of  signal  components  which  can  be  processed  in  parallel. 

lJ.  Walicki  is  with  Computer  Science  Department,  Colorado  State  University,  Fort  Collins,  CO  80523; 
M.  Andrews  is  with  Space  Tech  Corp.,  Fort  Collins,  CO  80526 

This  study  was  sponsored  by  the  U.S.  Army  Research  Office  under  grant  DAAG29-83-C-0025. 
The  U.S.  Government  assumes  no  responsibility  for  the  information  presented 
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The  goal  of  such  processing  is  well  defined  -  produce  a  signal  as  close  the  original  signal  as 
possible  (allowing  for  the  least  mean  square  error,  for  example)  without  prior  knowledge  of  the 
signal/noise  properties.  Thi3  goal  can  be  achieved  if  the  processing  is  self-optimizing,  that  is, 
if  processing  parameters  are  continuously  modified  so  that  the  error  criterion  mentioned  above 
is  satisfied.  An  adaptive  array  is  therefore  an  array  of  sensor  elements  plus  an  adaptive  control 
algorithm. 


The  implementation  of  the  adaptive  control  algorithm  can  be  viewed  as  a  problem  of  realizing 
a  normal  digital  filter  with  changing  filter  weights.  Several  performance  measures  exist  and  the 
nature  of  the  adaptive  algorithm  depends  on  the  selection  of  particular  measure.  Adoption  of  the 
mean  square  error  for  the  criterion  of  optimality  leads  to  the  elegant  Wiener  solution  for  optimal 
weights. 


If  an  array  produces  output  y(t)  which  is  obtained  by  weighing  input  vector  x(t)  then  an  error 
signal  is  a  difference  between  a  reference  signal  d(t)  and  the  output  y(t): 


«(*)  =  d{t)  -  wrx(£) 


The  expected  value  of  the  squared  error  is  then: 


E(e2(t))  =  <P{t)  -  2wrrli-l-  wrRIxw 


where  R„  is  the  autocorrelation  matrix  of  input  signal,  and  is  the  correlation  vector  of  input 
and  reference  signals.  If  d(t)  =  s(t)  then  d2(t)  =  S  which  is  a  signal  power.  It  can  be  shown  that 
the  value  of  w  which  minimizes  E(e2(t))  must  satisfy  the  Wiener-Hopf  equation: 


wapl —  i 


For  the  adaptive  array  described  above  the  performance  measure  (MSE)  is  a  quadratic  function 
of  the  weights.  Therefore  the  performance  measure  is  a  bowl-shaped  surface  and  the  goal  of  the 
adaptive  processor  is  to  find  a  bottom  of  that  bowl.  It  can  be  accomplished  by  any  ‘hill  climbing’ 
method.  The  idealized  version  of  the  ‘hill  climbing’  is  the  method  of  steepest  descent  which  assumes 
that  the  statistics  of  the  signal  environment  are  perfectly  known.  In  many  practical  situations, 
however,  the  signal  statistics  are  stationary  but  unknown.  For  such  problems  the  Widrow’s  Least 
Mean  Square  (LMS)  algorithm  described  in  the  next  section  is  particularly  useful  (Monz80). 


1.2  Implementations  of  Digital  Filters  and  the  LMS  algorithm 


Filtering  by  means  of  digital  hardware  offers  many  advantages  such  as  repeatability  and  con- 
trolability,  but  also  presents  a  problem  of  fast  and  efficient  execution,  especially  if  implemented  on 
a  classical  SISD  digital  processor.  Attempts  at  speeding-up  the  processing  had  been  directed  at 
improving  the  execution  efficiency  via  nontraditional  arithmetic  such  as  the  residue  number  system 
or  the  distributed  arithmetic  (Pele74,Cowa81,Cowa83). 


Efficient  realization  of  digital  filters  aimed  at  increasing  speed  and  reducing  power  consumption 
has  been  investigated  by  Peled  (Pele74).  He  proposed  storing  a  finite  number  of  results  of  interme¬ 
diate  arithmetic  operations  of  a  filter,  and  using  them  to  obtain  output  samples.  In  that  scheme 
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only  additions  and  shifts  ewe  necessary.  Elimination  of  multipliers  speeds  up  and  simplifies  the 
execution.  The  intermediate  results  are  stored  in  ROM  which  is  addressed  by  proper  combination 
of  bits  of  the  digitized  input  signal.  Cowan  and  others  (Cowa81,83)  proposed  implementation  of 
an  adaptive  filter  using  the  distributed  arithmetic.  Again  the  major  increase  in  execution  speed 
is  obtained  by  prestoring  all  possible  intermediate  results,  and  the  ROM  is  substituted  by  RAM 
since  updated  weights  of  the  adaptive  filter  affect  the  prestored  intermediate  results.  Both  Peled 
and  Cowan  deal  with  single  input  stream. 

Ercegovac  (Erce77)proposed  a  fast  computational  method  which  exploits  the  redundant  num¬ 
ber  representation  system.  The  method  utilizes  only  fixed-point  additions  without  overflow,  and 
therefore,  no  roundoff  errors  occur.  However,  errors  due  to  the  finite  precision  representation  have 
to  be  handled  by  extending  the  working  precision.  The  E-method,  as  it  is  called,  generates  one 
digit  of  each  of  the  elements  of  the  solution  vector  in  one  step,  starting  with  the  most  significant 
digits.  The  main  feature  of  this  method  is  fast  evaluation  of  polynomials,  rational  functions  and 
arithmetic  expressions  in  a  fixed-point  number  representation  system.  The  redundant  arithmetic 
has  been  used  in  the  digit  slice  proposed  in  (Chow83).  The  cell  is  a  three-level  digit  slice  capable  of 
implementing  basic  arithmetic  operations.  The  digit  slice  is  configured  as  a  linear  one-dimensional 
array.  Results  appear  at  the  bottom  of  each  cell.  It  is  worth  noting  that  the  basic  LMS  algorithm 
implemented  with  these  cells  can  operate  on  the  most  significant  digit  first.  Hence  ,  this  computa¬ 
tional  feature  makes  best  use  of  an  analog-to-digital  converter  which  produces  the  most  significant 
digit  first. 

In  what  follows  we  propose  a  fully  parallel  (SIMD)  architecture  for  realization  of  an  adaptive 
filter  array  which  processes  multiple  input  streams.  This  approach  became  feasible  with  advent  of 
processor  arrays  (also  called  systolic  arrays).  This  presentation  describes  a  proposed  architecture 
in  general  terms.  However,  the  72  processor  array  manufactured  by  NCR  directly  realizes  the 
structure  presented  here. 

Section  2  describes  the  discrete  LMS  algorithm  from  the  point  of  view  of  processor  array  imple¬ 
mentation,  and  proposes  basic  phases  of  processing.  In  the  section  3  implementation  of  the  basic 
operation  phases  is  discussed.  Section  4  deals  with  effects  of  the  finite  wordlength.  Section  5  offers 
a  summary  and  conclusions. 


2  Parallel  LMS  algorithm 


An  array  of  sensors  generates  vectors  of  samples  in  response  to  a  signal  wavefront  entering  the 
array.  This  input  vector  is  then  fed  into  the  pattern  forming  array  where  every  input  sample  is 
scaled  by  the  appropriate  weight  factor.  Contributions  of  individual  weighted  samples  are  summed 
yielding  the  single  output  sample: 

y(k)  =  «/<*,•(*)  (4) 

i=  1 


The  error  sample  e(k)  is  then  obtained: 


*(*)  =  d(k)  -  y(k) 
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(5) 


Ik 


The  error  value  is  scaled  by  the  convergence  factor  and  the  resulting  value  is  used  in  the  LMS  rule 
of  coefficients  update: 

■w(k  +  1)  =  w(&)  +  2A,e(fc)x(/c)  (6) 

where  A,  is  a  scalar  constant  controlling  rate  of  convergence  and  stability.  The  equation  above 
offers  a  simple  prescription  for  the  weight  adjustment  -  the  new  weight  vector  is  determined  by 
adding  the  input  signal  vector  scaled  by  the  error  value  to  the  old  weight  vector.  It  can  be  shown 
that  after  a  sufficient  number  of  iterations  the  LMS  solution  converges  to  the  Wiener  solution. 

It  is  possible  to  partition  the  LMS  adaptation  rule  into  the  basic  operations  suitable  for  the 
processor  array  implementation. 

Operation  1:  weight  input  vector  wTx 
Operation  2:  sum  partial  products  to  get  y(k) 

Operation  8:  compute  error  e(k) 

Operation  4-  scale  error  2A,e(jfc)  =  C 
Operation  5:  update  weights  w  «—  w  +  Cw 

It  can  be  seen  from  the  above  list  of  operations  that  they  can  be  grouped  into  two  distinct  phases 
-  compute  and  update.  It  is  worth  noting  that  this  grouping  is  not  only  a  matter  of  convenience  but 
also  that  there  exists  an  underlying  reason  for  such  an  arrangement.  The  operation  of  weighting  the 
input  vector  is  a  convolution  like  operation.  This  fact  places  specific  restrictions  upon  the  system 
under  consideration.  Namely,  the  system  has  to  be  linear  and  shift  (time)-invariant.  In  practice  it 
means  that  the  weight  coefficients  must  not  change  once  filter  convergence  is  reached.  Therefore, 
the  computations  are  performed  with  constant  coefficients  which  change  only  during  the  update 
phase. 

The  basic  processing  steps  listed  above  require  a  variety  of  elementary  matrix  operations.  Op¬ 
eration  1  requires  multiplication  of  two  vectors.  Operation  5  calls  for  multiplication  of  a  vector  by 
scalar  and  for  addition  of  two  vectors.  Operations  2,  3  and  4  require  operations  on  scalars.  The  val¬ 
ues  to  be  operated  on  are  represented  by  k-bit  words.  Existing  processor  (systolic)  arrays  are  serial 
architectures,  mainly  because  it  is  still  prohibitively  expensive  to  build  a  fully  parallel  single-wafer 
multi-processor  arrays.  Therefore,  the  operations  mentioned  above  have  to  be  performed  at  the  bit 
level.  As  the  result  the  vector  operations  become  operations  on  matrices  of  binary  representations 
of  signal  samples! 

Efficient  use  of  a  processor  array  is  thus  important  in  order  to  make  up  for  losses  caused  by 
the  use  of  serial  arithmetic.  As  Urquhart  and  Wood  (Urqu84)  show,  array  utilization  depends 
on  properly  feeding  in  samples  to  be  processed.  Particularly,  if  one  matrix  of  arguments  is  kept 
static  on  the  processor  array  and  the  other  matrix  is  entered  properly  skewed,  then  the  array  is 
100%  efficient.  In  our  case  it  is  quite  natural  to  keep  the  coefficients  w  fixed  in  the  array,  while 
bit-streams  of  input  samples  x  march  in  from  array  sensors.  The  arrangement  described  above  is 
a  basis  for  the  implementation  of  the  first  operation,  namely  formation  of  the  inner  product  wrx. 

The  second  operation,  that  of  summing  partial  products  in  order  to  obtain  a  value  of  the  output 
sample  y(t),  is  accomplished  by  using  an  adder  tree.  If  weight  coefficients  and  input  samples  interact 


8 

« 

R 

5 

6 


R 

S 


I 

8 

I 


E 

& 


I 

I 


in  such  a  way  that  the  least  significant  bits  are  processed  first,  then  the  output  from  the  adder  tree 
is  a  serial  stream  of  bits  with  the  least  significant  bit  leading. 

In  the  third  step,  the  error  sample  is  generated  by  subtracting  the  y  stream  from  the  d  stream. 
The  d  stream  is  a  serialized  sample  of  a  reference  signal. 

The  computed  value  of  the  error  is  scaled  in  the  next  operation  (4).  This  is  accomplished  easily 
by  properly  shifting  the  e(k). 

The  remaining  fifth  step  is  the  most  complex,  since  it  involves  multiplying  a  vector  by  a  scalar 
and  addition  of  vectors  (bit  matrices).  The  first  operation  in  this  step  multiplies  the  weight  vector 
w  by  the  scaled  error  e(k).  Since  all  weights  are  scaled  by  the  same  scalar  value  it  merely  requires 
a  simple  shifting  of  old  weights  by  a  number  of  places  prescribed  by  the  value  of  2A,e(ifc).  This 
operation  can  be  performed  in  a  separate  processor  array  with  original  weights  doubled  just  for 
the  purpose  of  that  processing.  Or,  given  a  sufficiently  powerful  processing  element  (as  we  will  see 
in  the  next  section),  it  is  possible  to  scale  weights  in  situ.  The  last  step  requires  us  to  add  the 
old  weights  to  the  scaled  values  obtained  in  the  previous  operation.  Again,  given  an  appropriate 
processing  element  (like  the  processor  array  mentioned  in  the  introduction),  this  operation  can 
be  done  in  the  original  array.  These  steps  complete  the  update  phase  and  a  new  input  sample  is 
processed  as  in  the  first  step. 

The  next  section  contains  a  detailed  description  of  the  architecture  of  the  LMS  adaptor.  This 
implementation  is  general  but  can  be  easily  implemented  in  the  NCR’s  72  processor  array. 

3  Architecture  of  the  LMS  Processor 

This  section  describes  the  ovetall  architecture  of  the  LMS  processor.  The  heart  of  the  system 
is  a  processor  array  whose  main  functions  are:  the  storage  of  adaptive  coefficients  (weights)  and 
generation  of  an  output  sample  y(k). 

Without  loss  of  generality  let  us  assume  that  the  input  vector  is  generated  by  a  six-element  sen¬ 
sor  array,  and  that  both  input  samples  and  weight  coefficients  Me  represented  as  12-bit  quantities. 
It  can  also  be  assumed  that  these  quantities  are  positive  and  less  then  1.0.  Extension  to  the  two’s 
complement  representation  is  rather  straightforward  (Cowa81,McCa84). 

These  assumptions  allow  direct  implementation  of  the  72-processor  (6  x  12)  array  manufactured 
by  the  NCR  Corp.  The  Geometric  Arithmetic  Parallel  Processor  (GAPP-II)  is  a  rectangular  systolic 
array  processor  chip  which  can  be  cascaded  in  both  north/south  and  east/west  orientations.  Each 
1-bit  processor  element  (PE)  can  communicate  with  its  four  neighbors.  Each  PE  consists  of  a 
bit-serial  ALU,  128  x  1  individually  addressable  RAM  and  4  single  bit  latches.  The  I/O  latch 
allows  communication  through  the  PE  without  interrupting  the  ALU,  the  remaining  latches  hold 
inputs  to  the  ALU.  The  GAPP  operates  as  a  SIMD  machine.  That  is,  instructions  are  broadcast 
to  each  cell  from  an  external  control  store,  loaded  in  turn  from  the  host  computer.  Proper  address 
sequencing  can  be  provided  by  any  general  address  sequencer  like  the  microprogrammable  AMD 
2910  sequencer.  The  instructions  directed  to  the  processing  elements  consist  of  a  13-bit  control 
field  which  specifies  the  array  connectivity  and  arithmetic/logic  operations,  and  7-bit  RAM  address. 
These  instructions  can  be  sent  to  the  GAPP  array  at  the  rate  of  10  MIPS.  The  array  has  a  global 
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broadcast  of  input  data  and  a  global  output.  One  GAPP  array  device  can  function  as  a  modular 
component  in  a  larger  array,  thus  enabling  word  or  bit-length  growth  of  an  array  as  needed.  In  L.V1S 
applications  it  may  be  desirable  to  increase  the  number  of  coefficients,  which  can  be  achieved  by 
simply  cascading  another  array  expanding  a  filter  to  12  coefficients.  Likewise,  cascading  one  array 
in  the  direction  of  input  stream,  increases  the  wordlength  from  12  to  24  bits.  (Davi84,Gapp84). 

The  following  diagram  schematically  shows  operation  of  the  first  processor  array  which  contains 
coefficients  in  RAM  locations  and  computes  wrx. 
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The  u/‘ 's  are  bits  of  a  digital  representation  of  the  weight  coefficient  w,  which  are  stored  in  the 
RAM  locations  of  the  individual  processing  elements.  Partial  products  appear  on  the  bottom  edge 
of  the  array,  represented  here  by  ti/'5  bits.  Every  processing  element  performs  the  following  basic 
operations: 

zwnt*~  xtait 

y.out  h*~  Vn  orth  xtatt*aceumulattd 

If  the  least  significant  bits  of  x  and  w  interact  first,  then  partial  products  of  equal  significance 
leave  the  edge  of  the  array  at  the  same  time.  These  partial  products  can  be  accumulated  using 
the  adder  tree  constructed  from  MSI  adders.  If,  however,  the  least  significant  bit  z°  interacts  with 
the  most  significant  bit  of  w  then,  the  partial  products  of  equal  significance  appear  skewed  and 
the  linear  chain  of  full  adders  can  be  used  to  accumulate  the  final  output.  If  the  second  option  is 
chosen  then  a  single  row  of  PE’s  of  a  second  processing  array  can  be  used  to  accumulate  partial 
products.  In  order  to  handle  properly  the  operations  in  the  main  array  guard  bands  of  login  bits 
are  appended  to  the  input  parallelogram. 

The  second  array  implements  operations  2,  3  and  4.  That  is,  the  computations  of  the  output 
sample  y(k)  and  the  scaled  error  2A,e(fc).  As  it  was  mentioned  above,  the  first  step  is  to  sum  partial 
products  arriving  from  the  main  processing  array.  These  partial  products  are  accumulated  in  the 
elements  of  the  first  row.  Because  of  the  additional  skewing  of  the  partial  products  leaving  the  first 
array,  the  delay  is  needed  between  the  accumulating  PE’s.  In  the  GAPP  device  this  can  be  easily 
accomplished  by  storing  the  elementary  sums  in  the  local  RAM  locations.  Properly  accumulated 
result  y(k)  leaves  the  east  processing  element  and  is  fed  into  the  third  row  of  processing  elements 
of  the  second  array  (see  Fig.l).  This  is  the  end  of  the  compute  phase.  The  remaining  processing  is 
devoted  to  updating  the  coefficient  weights. 

An  error  value  e(k)  is  computed  by  subtracting  the  output  sample  from  the  reference  sample 
d(k)  which  is  shifted  into  the  fourth  row  of  the  array  simultaneously  with  the  loading  operation  of 
the  third  row.  The  result  of  this  operation  remains  in  the  fourth  row  and  is  shifted  according  to 
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the  value  of  2A,.  The  quantity  obtained  in  this  step  is  the  scaled  error  value  and  it  is  sent  to  the 
host  controller. 

The  controller  uses  this  value  to  scale  (shift)  the  original  coefficients  residing  in  the  main 
array.  However,  before  this  operation  is  performed  the  original  weights  must  be  saved  in  the  RAM 
locations.  The  scaled  weights  are  also  uploaded  into  the  local  RAM,  and  then  both  quantities  are 
summed  (w  «—  w  +  2A,w)  yielding  the  new  updated  coefficients,  which  are  used  to  compute  a  new 
output  sample,  and  the  computation  processes  is  repeated. 


4  Effects  of  the  Finite  Wordlength 


The  derivation  of  the  LMS  algorithm  presented  in  the  previous  sections  assumes  the  infinite 
precision  arithmetic.  Under  this  assumption  the  only  source  of  inaccuracy  is  the  fundamental  prin¬ 
ciple  of  the  algorithm.  That  is,  the  fact  that  the  gradient  which  steers  towards  the  optimal  solution 
cannot  be  computed  accurately,  but  it  has  to  be  estimated.  However,  practical  implementations 
have  to  deal  with  effects  of  the  limited  wordlength,  especially  those  that  use  relatively  few  bits 
(8-16). 

Effects  of  the  finite  wordlength  manifest  themselves  in  various  stages  of  processing,  or  even 
before  the  actual  processing  begins  as  is  the  case  with  the  quantization  noise.  Assuming  the  12-bit 
representation,  it  can  be  shown  that  the  input  signal-tonoise  ratio  is  around  70  dB.  If  the  noise  is 
white  and  the  processing  is  realized  without  an  error,  then  the  mean  and  the  variance  of  output 
noise  are: 

OO 

my  =  mt  Yl  Mn)  (7) 

n  =  —  OO 

a\  =  al  11.  i  *(»)  i2  (8) 

n  =  oo 

where  h(n)  is  the  impulse  response  of  the  linear  filter  (Oppe75). 

In  the  real  implementation  of  the  LMS  algorithm  there  exist  two  types  of  noise.  The  gradient 
noise  caused  by  estimating  the  gradient  with  a  finite  amount  of  input  data,  and  noise  caused  by 
imprecise  results  of  intermediate  processing  operations  due  to  the  finite  wordlength.  The  operations 
of  addition  and  multiplication  at  each  weight  adaptation  cycle  are  affected  by  random  errors  ea  and 
em.  Additionally,  the  iterative  nature  of  the  algorithm  causes  noise  errors  to  accumulate  during 
every  compute/update  cycle: 


wJ+1=  w;  -f  2 A ,e(j)x; 


(9) 


Let’s  define 


j 

w;  =  w;  + 

i=  l 


(10) 


Then 


w'*!=  w'  +emj+  ,+  ea;+  i4-  2A,e(;)x, 


(ID 


It  can  be  safely  assumed  that  the  errors  ea  and  em  are  uncorrelated  (Andr77,Andr83).  The 
computation  noise  causes  deviation  of  updated  weights  from  the  optimal  values  wopt  resulting  in 
the  difference 


v'  =  w'  -  wop( 


(12) 


The  j  —  1  difference  is  then 


And  finally  from  12: 


Vj+i=wj  +  2  A,e(j)x;  +tm)+,+  wop( 


vy+i=v'+2Mi)x,+emj  +  !+ea,  +  , 


(13) 

(14) 


Equation  14  accounts  for  both  the  gradient  estimation  noise  and  the  computational  noise. 
Computer  simulations  based  on  the  presented  model  lead  to  the  following  conclusions  (Andr78). 
The  LMS  algorithm  can  be  successfully  implemented  in  a  short  digital  word  (8-12  bits)  but  large 
values  of  the  convergence  factor  2A,  are  needed  to  ensure  convergence  and  unbiased  behavior  of  the 
adaptive  filter.  The  testbed  realization  of  the  8-bit  adaptive  filter  by  Cowan  and  others  (Cowa83) 
also  confirms  these  conclusions.  Therefore  the  12-bit  realization  of  the  LMS  on  the  systolic  array 
should  not  suffer  from  the  relatively  short  wordlength. 


5  Summary  and  Conclusions 

We  have  presented  the  systolic  implementation  of  the  LMS  algorithm.  The  proposed  architec¬ 
ture  is  fully  parallel  and  operates  on  the  bit  level  which  efficiently  exploits  the  serial  representation 
of  data.  Even  in  the  basic  structure  of  two  elementary  arrays  (GAPP  devices)  the  relatively  high 
resolution  of  12  bits  is  achieved.  Cascading  more  devices  allows  for  increase  in  the  resolution  and 
a  filter  size.  Fast  and  accurate  adaptive  filtering  by  means  of  the  digital  hardware  becomes  more 
important  as  the  communication  and  intelligence  needs  increase.  For  example  the  array  processor 
like  the  one  described  here  may  digitally  control  individual  radiating  elements  of  large  phased  ar¬ 
ray  radars  and  the  same  time  digitally  enhance  desired  signals  by  eliminating  clutter  and  jamming 
signals. 

The  GAPP  architecture  is  ideally  suited  to  distributed  arithmetic  implementations  of  many 
signal  processing  algorithms.  Our  LMS  filter  is  more  demanding  than  fixed-coefficient  filters  because 
weights  change  with  each  iteration.  The  local  memory  of  the  GAPP  array  allows  storing  and 
updating  the  filter  coefficients.  In  conclusion,  GAPP  devices  add  a  new  dimension  to  parallel 
implementations. 
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Alternative  arithaetic  engines  are  being  analysed  to  determine  applicability  to 
adaptive  signal  processors.  Synchrony,  topology,  and  granularity  are  important 
design  figures  of  merit  for  any  VLSI  implementation.  In  this  paper,  redundant 
arithmetic  is  considered  with  the  desirable  attributes,  basically  because  carry-free 
operations  are  possible  resulting  in  a  speed-up.  A  primitive  computational  element 
is  discussed  for  the  least-aean-square  (LMS)  algorithm  embedded  in  redundant 
arithmetic  processors. 

The  basic  approach  has  been  to  analyse  the  hardware/software  trade-offs  of 
conventional  (von  Neumann)  and  non-conventional  (parallel,  pipeline,  vector,  array, 
and  custom  processors)  in  an  attempt  to  identify  optimal  structures  (of  order  area  x 
tiae)  which  are  computationally  fast  yet  flexible.  Regular  and  simple 
interconnections  and  geometries  among  high-performance  parallel  structures  were 
sought.  A  two  step  process  can  be  assumed;  first  the  sequential  algorithms  are 
speeded  up  (seeking  inherent  parallelism)  and  second,  faat  algorithms  are  to  be 
mapped  onto  new  VLSI  architectures  (vis  recursion  and  pipelining).  The  current 
research  is  to  provide  theoreticsl  design  tools  and  interconnection  strategies 
capable  of  achieving  real-time  implementation  of  adaptive  algorithms  via  limited 
user-programmable  mechanisms  (e.g.,  firmware). 

The  traditional  criteria  of  component  count,  whether  applied  to  processors  or 
to  simpler  devices,  are  no  longer  adequate  to  establish  a  scale  of  comparison  among 
various  solutions  to  a  given  VLSI  problem.  Indeed,  number-of-elementa  criteria  are 
substantially  based  on  the  fact  that  processing  elements  and  their  interconnection 
are  realised  by  different  media.  This  difference  disappears  in  VLSI,  which 
“integrates"  both  processing  elements  and  their  interconnections  in  a  two- 
dimensional  geometry.  As  a  result,  the  VLSI  solution  to  a  given  computational 
problem  involves  the  conception  of  an  interconnection  architecture,  its  layout,  and 
the  design  of  an  algorithm  for  that  architecture.  It  is  useful  to  examine  the 
trade-offa  between  the  production  cost  (area)  and  the  incremental  cost  (tiae)  of  a 
dedicated  circuit  developed  to  solve  that  problem.  The  area  of  a  chip  can  be 
partitioned  into  interconnect  or  wire  area,  A(w),  gate  area,  A(g),  and  wire  pad 
area,  A(p).  At  least  for  the  class  of  transitive  functions  (cyclic  shifts,  matrix 
product,  integer  product,  and  linear  transforms),  Vuillemin  has  shown  such  VLSI 
circuits  satisfy 

A  >  A(g)M  ♦  A(p)B  ♦  A(w)B  where  in  practice  A(w)»A(g) ,A(p) 

where  A  is  the  chip  area,  N  is  the  number  of  inputs,  and  B  is  the  average  bandwidth 
(bits  per  second  passing  through  the  chip  pins). 

A  good  arithaetic  unit  for  digital  signal  proc  ssing,  in  addition  to  the  above, 
should  have  the  following: 

1.  As  modules,  the  ALU  number  of  modules  and  the  precision  of  the  operands  should 
not  affect  the  time  to  perform  one  arithaetic  step  such  as  addition,  subtraction, 
multiplication,  and  division. 

2.  Each  digit  produced  should  not  be  dependent  on  very  many  adjacent  digits  to 
eliminate  excessive  carry  propagation  and  hardware  error  checking.  “Column" 
operations  help  detect  and  correct  hardware  errors  independently. 

3.  Bound-off  error  from  truncation  should  have  no  bias. 
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4.  The  leading  digit  or  sign  bit  should  be  processed  just  like  any  other  digit. 

5.  Overflow  should  be  detected  early.  At  best,  overflow  could  be  detectable  whe 
the  first  few  digits  or  leading  digits  are  generated. 

Points  two  and  five  should  not  be  taken  lightly.  Signal  processing  application 
frequently  require  robust  isplenentationa  which  can  operate  in  harsh  environments 
Faulty  units  need  to  be  self-checking  and  correcting.  Furthermore,  the  wide  dynami 
range  of  signal  inputs  aa  found  in  radar  often  causes  overflow  in  conventional  two' 
complement  engines,  thereby  invoking  elaborate  dynamic  scaling  operations 
Detecting  the  overflow  at  the  onset  of  the  aost  significant  digit  production 
minimises  the  scaling  overhead  and  subsequent  pipeline  flushes. 

The  redundant  number  digit  slice  proposed  in  [ij  and  cascaded  as  4-digit  sliest 
into  a  module  incorporates  all  of  the  attributes  listed  above.  Depicted  in  Fig. 
is  a  15-digit  nodule  with  three  levels.  The  function  of  each  level  is  described  it 
Table  1. 


Fig.  1  A  15-Digit  Redundant  Number  Module 


Table  1 
Multiplication 
Addition 

Assimilate 


Digit  Slice  Functions 
ru  ♦  w  < —  be 
r*  ♦  t  < —  a  ♦  w'  ♦  u 
or  rx  ♦  t  < —  a  -  w*  -  u 
-rb*  ♦  a  <—  a  -  b* 
or  s  < —  t*  ♦  * 


The  digit  slices  can  be  configured  easily  into  V1£I.  The  inputs  are  a,  b,  and  c. 
The  output  is  s.  Interslice  transfers  include  w,  t,  and  b*  for  eastward  direction 
and  w',  t',  and  b"  for  westward  direction.  All  of  these  are  intermediate  results 
used  in  adjacent  digit  positions.  The  assimilate  function  assists  in  the  conversion 
from  redundant  to  non- red  undent  number  representation.  It  has  already  been  shown 
that  a  digit  slice  with  400  gates  constructed  with  16  cascaded  sum-of -products 
circuits  can  implement  redundant  arithmetic  in  radix-16  with  10  as  the  maximum  digit 
[ij.  Current  research  is  exploring  the  processing  delay,  latency,  and  execution 
time  of  the  redundant  number  system  processors  [2]. 
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